Prashanth G B
Prashanth G B

Reputation: 1

Record count for Hive partitioned table

I have a table called "transaction" in Hive which is partitioned on a column called "DS" which will have data like "2018-05-05", "2018-05-09", "2018-05-10" and so on

This table is populated overnight for the day which got completed. At any point, the table will have data till previous day

When I query trasaction table like

SELECT COUNT(*) FROM trasaction WHERE DS >= "current date";

I get

0 rows - which is correct because data has not been loaded yet for current and future dates

When I run below query

SELECT DISTINCT DS FROM trasaction WHERE DS >= "current date";

I get

2018-05-09
2018-05-10
2018-05-11
2018-05-12
2018-05-13
...
2018-08-30

I have checked the HDFS location and partition folders already exists for future dates as well. I am guessing the people who do ETL, have already added partition for future dates using

ALTER TABLE trasaction ADD PARTITION (DS = '2018-05-13') and so on

Am I doing wrong here? Even if the output of both queries are correct, what is the reason for above output?

Upvotes: 0

Views: 9792

Answers (1)

Jagrut Sharma
Jagrut Sharma

Reputation: 4754

The observed output is due to Hive partitions being present for the future, but data files have not yet been populated for them in HDFS.

Try these commands to get additional insight on partitions with and without data. The illustration assumes 3 partitions were created, and the 3rd one does not yet have data file.

Get the partitions created for the table

hive> show partitions transaction;

-- output 
ds=2018-05-13
ds=2018-05-14
ds=2018-05-15

Get the record count (for partitions with data):

-- Only partitions having data will be returned. In this case, partitions '2018-05-13' and '2018-05-14' have data. Hence, partition '2018-05-15' is not included.

select ds, count(*) 
from transaction 
where ds > '2018-05-12'
group by ds;

-- output
2018-05-13  3
2018-05-14  5

Upvotes: 2

Related Questions