Reputation: 364

Hive Table Load the Data From Hdfs Location with handled Duplicate Files

There is scenario if daily files loading particular path of HDFS location. on top of that path we have created Hive external table to load the data into table in hive. there is worst scenario the files pushed to particular path(HDFS) two times or duplicate files.

How do we load second files instead of doing delete or other job running. what is the best practice to handle this scenario.

Kindly clarify

Upvotes: 0

Answers (1)

Clover

Reputation: 19

Duplicated files with similar filanames are not possible in HDFS. If you worry about two files with possible similar content, you might want to load it as is to avoid missing data and maintain a managed table that handles the duplicates.

Use case: Get only latest file

Detect latest file from HDFS directory:

hdfs dfs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3

Then, move it to another HDFS directory. This directory should be emptied because we want latest file only.

# delete old files in here
hdfs dfs -rm -r /your/hdfs/latest_dir/ 
# copy latest file 
hdfs dfs -cp $(hadoop fs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3) /your/hdfs/latest_dir/

Upvotes: 0

Hive Table Load the Data From Hdfs Location with handled Duplicate Files

Answers (1)

Related Questions