Reputation: 1
I have a table called "transaction" in Hive which is partitioned on a column called "DS" which will have data like "2018-05-05", "2018-05-09", "2018-05-10" and so on
This table is populated overnight for the day which got completed. At any point, the table will have data till previous day
When I query trasaction table like
SELECT COUNT(*) FROM trasaction WHERE DS >= "current date";
I get
0 rows - which is correct because data has not been loaded yet for current and future dates
When I run below query
SELECT DISTINCT DS FROM trasaction WHERE DS >= "current date";
I get
2018-05-09
2018-05-10
2018-05-11
2018-05-12
2018-05-13
...
2018-08-30
I have checked the HDFS location and partition folders already exists for future dates as well. I am guessing the people who do ETL, have already added partition for future dates using
ALTER TABLE trasaction ADD PARTITION (DS = '2018-05-13')
and so on
Am I doing wrong here? Even if the output of both queries are correct, what is the reason for above output?
Upvotes: 0
Views: 9792
Reputation: 4754
The observed output is due to Hive partitions being present for the future, but data files have not yet been populated for them in HDFS.
Try these commands to get additional insight on partitions with and without data. The illustration assumes 3 partitions were created, and the 3rd one does not yet have data file.
Get the partitions created for the table
hive> show partitions transaction;
-- output
ds=2018-05-13
ds=2018-05-14
ds=2018-05-15
Get the record count (for partitions with data):
-- Only partitions having data will be returned. In this case, partitions '2018-05-13' and '2018-05-14' have data. Hence, partition '2018-05-15' is not included.
select ds, count(*)
from transaction
where ds > '2018-05-12'
group by ds;
-- output
2018-05-13 3
2018-05-14 5
Upvotes: 2