Abhay Dandekar
Abhay Dandekar

Reputation: 1300

Hive showing non-existing partitions

I have a table in hive whose DDL looks like below:

CREATE TABLE test(id string) PARTITIONED BY (data_updated string,batch_id string,country string,state string);

data_updated=NO/batch_id=102/country='1'/state='A';
data_updated=NO/batch_id=103/country='2'/state='B';
data_updated=NO/batch_id=104/country='3'/state='C';

I want to move the data from data_updated=NO to data_updated=YES. And I do that via HDFS commands.

data_updated=YES/batch_id=102/country='1'/state='A'
data_updated=YES/batch_id=103/country='2'/state='B'
data_updated=YES/batch_id=104/country='3'/state='C'

One option is to write the multiple rename partition DDLs. But, alter partition DDLs require all partitions to be specified upto Nth level. N-level is dynamic and cannot be statically determined.

I am getting the batch_ids with the help of unix script and moving the directory manually using move command in HDFS.

Inside a loop:
hadoop fs -mv /table/directory/data_processed=NO/batch_id=$i /table/directory/data_processed=YES/

The problem with moving with HDFS commands is after repairing the table in hive, the non existing partitions still exists in hive metadata. How to remove non-existing partitions from hive metadata. Any suggestions?

Output of show partition in hive.

show partitions test;

data_updated=NO/batch_id=102/country='1'/state='A'; # NON-EXISTING
data_updated=NO/batch_id=103/country='2'/state='B'; # NON-EXISTING
data_updated=NO/batch_id=104/country='3'/state='C'; # NON-EXISTING
data_updated=YES/batch_id=102/country='1'/state='A'
data_updated=YES/batch_id=103/country='2'/state='B'
data_updated=YES/batch_id=104/country='3'/state='C'

Upvotes: 1

Views: 1870

Answers (2)

moebius
moebius

Reputation: 2269

As per this answer, msck repair table will not delete any metadata from the metastore for partitions manually deleted. For that you will have to run alter table drop partition (...) to update the metadata.

Since your partitions are dynamically generated, I would suggest you write a script to run the alter table commands, rather than running them manually. For example:

for path in $(hive -e "show partitions test" | grep "data_updated=NO")
do
  part=$(echo $path | tr "/" " ")
  echo "alter table test drop partitions($(echo $part | tr " " ","));"
done > drop_partitions.hql

Upvotes: 1

hlagos
hlagos

Reputation: 7947

My understanding is that there is no way from Hive to remove partitions based on missing hdfs directories. The easiest way that I see is drop your partitions base on conditions like this

alter table test drop if exists partition (data_updated='NO');

Upvotes: 1

Related Questions