Reputation: 187

Hdfs data corruption issue

we have a data corruption issue at our hadoop cluster. We have a managed table on hive which contains three years of data partitioned by year.

Below two queries run fine without any issue

select count(*) from tkt_hist table where yr=2015

select count(*) from tkt_hist table where yr=2016

select count(*) from tkt_hist table where yr=2017

However, when we try to do group by per year, below error is shown.

Error while compiling statement: FAILED: SemanticException java.io.FileNotFoundException: File hdfs://ASIACELLHDP/apps/hive/warehouse/gprod1t_base.db/toll_tkt_hist_old/yr=2015/mn=01/dy=01 does not exist. [ERROR_STATUS]

Even select will not work when we specify a year other than 2015.

//this works fine
Select * from tkt_hist where yr=2015 limit 10;

// below throws same  error mentioned above.
Select * from tkt_hist where yr=2016;

Upvotes: 0

Answers (2)

Ramesh babu M

Reputation: 51

Try increasing java heap space (increase reducer memory if it doesn't work).

For example:

set mapreduce.map.java.opts = -Xmx15360m

Upvotes: 1

kfkhalili

Reputation: 1035

You will have to drop the partitions manually because msck repair table only adds partitions but doesn't remove existing ones.

You will have to iterate through the corrupt partitions list. For internal tables, you'll have to be specific, as dropping a partition deletes the underlying physical files.

ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr=2015, mn=01, dy=01);

You will need to do this for each partition. You could put it in a bash script and execute it with hive -e or beeline -e commands to work with a quoted query string.

If you are using an external table, then it's much easier to remove all partitions and then repair table.

ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr<>'', mn<>'', dy<>'');

Make sure to repair the table as the user owning the Hive DB as well as the HDFS path.

MSCK REPAIR TABLE tkt_hist;

This should add partitions folders currently available in the table path without adding the invalid partitions.

Note: If your user isn't the owner of the directory, ensure you have write permissions and do your work in the hive access client as beeline requires absolute ownership rights to work.

Upvotes: 0

Hdfs data corruption issue

Answers (2)

Related Questions