Reputation: 47
I have a BIG table in Hive 0.13 - it has approx 250 GB of data per day. Per hour, it is, hence, approx, 10 GB of data. I have a BI Tool which would like to access this table's data on per day or per hour basis for which I need to test the queries which the BI tool would generate and run on Hive.
One of the queries, when BI is used for daily data for yesterday, looks like below:
select count(*)
from my_table
where
yyyy=year(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and mm=month(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and dd=day(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
;
My Table in Hive in MY_TABLE while YYYY, MM and DD are the partitioned columns in MY_TABLE. It is already stored in ORC Format.
The above query runs for very good amount of time, post which when I see the EXPLAIN EXTENDED output, I clearly see that it is doing a FULL TABLE SCAN of MY_TABLE irrespective of filter conditions.
How can we avoid this issue ?
Kindly advise.
Note again : Hive version is 0.13. We're in middle of an upgrade.
Thanks,
Suddhasatwa
Note:
The solution provided here (Why partitions elimination does not happen for this query?) is not applicable in my case, since I am using Hive 0.13 while CURRENT_DATE function is available only post Hive version 1.+.
Upvotes: 1
Views: 2541