Reputation: 187
we have a data corruption issue at our hadoop cluster. We have a managed table on hive which contains three years of data partitioned by year.
Below two queries run fine without any issue
select count(*) from tkt_hist table where yr=2015
select count(*) from tkt_hist table where yr=2016
select count(*) from tkt_hist table where yr=2017
However, when we try to do group by per year, below error is shown.
Error while compiling statement: FAILED: SemanticException java.io.FileNotFoundException: File hdfs://ASIACELLHDP/apps/hive/warehouse/gprod1t_base.db/toll_tkt_hist_old/yr=2015/mn=01/dy=01 does not exist. [ERROR_STATUS]
Even select will not work when we specify a year other than 2015.
//this works fine
Select * from tkt_hist where yr=2015 limit 10;
// below throws same error mentioned above.
Select * from tkt_hist where yr=2016;
Upvotes: 0
Views: 1527
Reputation: 51
Try increasing java heap space (increase reducer memory if it doesn't work).
For example:
set mapreduce.map.java.opts = -Xmx15360m
Upvotes: 1
Reputation: 1035
You will have to drop the partitions manually because msck repair table
only adds partitions but doesn't remove existing ones.
You will have to iterate through the corrupt partitions list. For internal tables, you'll have to be specific, as dropping a partition deletes the underlying physical files.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr=2015, mn=01, dy=01);
You will need to do this for each partition. You could put it in a bash script and execute it with hive -e
or beeline -e
commands to work with a quoted query string.
If you are using an external table, then it's much easier to remove all partitions and then repair table.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr<>'', mn<>'', dy<>'');
Make sure to repair the table as the user owning the Hive DB as well as the HDFS path.
MSCK REPAIR TABLE tkt_hist;
This should add partitions folders currently available in the table path without adding the invalid partitions.
Note: If your user isn't the owner of the directory, ensure you have write permissions and do your work in the hive
access client as beeline
requires absolute ownership rights to work.
Upvotes: 0