Hive delete duplicate records

Question

In hive, how can I delete duplicate records ? Below is my case,

First, I load data from product table to products_rcfileformat. There are 25 rows of records on product table

FROM products INSERT OVERWRITE TABLE products_rcfileformat         
SELECT *;

Second, I load data from product table to products_rcfileformat. There are 25 rows of records on product table. But this time I'm NOT using OVERWRITE clause

FROM products INSERT INTO TABLE products_rcfileformat         
SELECT *;

When I query the data it give me total rows = 50 which are right

Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0

Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ? If I'm not mistaken, i can remove xxx_copy_1 file by using hdfs dfs -rm command follow by rerun insert overwrite command. But I want to know whether this can it be done by using hive command example like delete statement?

novice_2 · Accepted Answer

Check from hdfs, it seem hdfs make another copy of file xxx_copy_1 instead of append to 000000_0

The reason is hdfs is read only, not editable, as hive warehouse files (or whatever may be the location) that is still in hdfs, so it has to create a second file.

Now I want to remove those records that read from xxx_copy_1. How can I achieve this in hive command ?

Please check this post - Removing DUPLICATE rows in hive based on columns.

Let me know if you are satisfied with the answer there. I have another method, which removes duplicate entries but may not be in the way you want.

Hive delete duplicate records

Answers (2)

Related Questions