Reputation: 739
I have an external table in hive and pointing to HDFS location. By mistake I have ran the job to load the data into HDFS two times.
Even after deleting the duplicate file from HDFS hive is showing the data count two times(i.e. including deleted duplicate data file count).
select count(*) from tbl_name -- returns double time
But ,
select count(col_name) from tbl_name -- returns actual count.
Same table when I tried from Impala after
INVALIDATE METADATA
I could see only data count which is available in HDFS(not duplicate).
How can hive give count as double even after deleting from physical location(hdfs) , does it read from statistics?
Upvotes: 1
Views: 1491
Reputation: 38290
Hive is using statistics for computing cont(*)
. You deleted files manually (not using Hive) that is why the stats is wrong.
The solution is:
to switch-off statistics usage in such cases:
set hive.compute.query.using.stats=false;
to analyze table as you mention in your comment:
analyze table tbl_name partition(a,b,c) compute statistics;
Upvotes: 1