Reputation: 2301
In hive, how can I delete duplicate records ? Below is my case,
First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table
FROM products INSERT OVERWRITE TABLE products_rcfileformat
SELECT *;
Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause
FROM products INSERT INTO TABLE products_rcfileformat
SELECT *;
When I query the data it give me total rows = 50 which are right
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0
Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?
Upvotes: 1
Views: 2264
Reputation: 505
Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0
The reason is hdfs is read only, not editable, as hive warehouse files (or whatever may be the location) that is still in hdfs, so it has to create a second file.
Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ?
Please check this post - Removing DUPLICATE rows in hive based on columns.
Let me know if you are satisfied with the answer there. I have another method, which removes duplicate entries but may not be in the way you want.
Upvotes: 0
Reputation: 4957
Partition your data such that the rows (use window function row_number) you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.
detail about Partition .
www.tutorialspoint.com/hive/hive_partitioning.htm
Upvotes: 1