Reputation: 5
We have an Amazon EMR Cluster (v5.19.0) where we use Presto (v0.212) over Hive (v2.3.2) to process data. When the topic is data reading and writing, it is a monster and does everything pretty fast.
On the other hand, I am pretty frustrated with the data exclusion options. There are a lot of posts regarding data access and writing on Internet, but almost nothing but basic use-cases regarding data deleting. Here are some methods which I tried to use:
Presto delete statement, which seems to fail randomly. It works fine for small tables but it starts to raise random exceptions for other tables (most of them regarding missing files which it is deleting). We are planning to update the EMR version soon to see if this issue ceases, but for while it isn't reliable (or we configured something wrong);
Hive drop partition statement. This one is surprisingly slow. For bigger tables (more than 4000 partitions), it takes minutes to drop a partition referencing an empty/deleted folder. I really don't understand how this command can be so slow;
Amazon S3 / HDFS RMDIR command. Actually we are using this one, it can remove partitions in less than a second.
The last method seems to work fine when we use Presto queries to access data. However, we noticed that the partitions still present on Hive Metastore, making Hive raise exceptions when trying to execute any query and inflating the amount of partitions on it. Since Hive is super slow to drop partitions, we don't know what to do to keep the Metastore clean and have a fast process.
On the Hive documentation there is a section regarding the MSCK REPAIR TABLE command which includes an option to remove missing partitions. Sadly when I try to run it on my terminal using the "DROP PARTITIONS" parameter, it shows an error message "FAILED: ParseException line 1:34 missing EOF at 'drop' near 'TABLENAME'". So I believe that my Hive version isn't compatible or there is a bug on it.
So, do you know a good way to delete partitions on a real system using a configuration like the mine? Please tell me how you do to delete data on your big data pipeline to see if I can find some inspiration to my problem. Also please let me know if you know some method of removing only Partition references from Hive or to list all partitions whose data was deleted. Thanks!
Upvotes: 0
Views: 14812
Reputation: 5
including here more details of how I solved this issue. Note that, if possible, avoid this solution and use the Delete function from your data processing tool.
ALTER TABLE table_name DROP PARTITION(...
statement if possible;aws s3 rm
or hadoop fs -rm
;ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='TRUE');
ALTER TABLE tablename DROP PARTITION(...
ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='FALSE');
Also check Piotr's answer to this answer to see a nice way of deleting Partitions if you are using a more updated Presto.
Upvotes: 0
Reputation: 20730
As you observed, if you drop partition data (files & directories) on S3 or HDFS, the partitions still need to be deregistered from the Hive metastore.
The Hive way of syncing storage's state with metastore's state is MSCK REPAIR TABLE
.
The Presto way of syncing storage's state with metastore's state is system.sync_partition_metadata
Presto Hive connector procedure.
Upvotes: 2
Reputation: 38335
Try ALTER TABLE table_name RECOVER PARTITIONS;
instead of MSCK REPAIR TABLE
command. It should work fine on AWS.
Upvotes: 0