Reputation: 364
There is scenario if daily files loading particular path of HDFS location. on top of that path we have created Hive external table to load the data into table in hive. there is worst scenario the files pushed to particular path(HDFS) two times or duplicate files.
How do we load second files instead of doing delete or other job running. what is the best practice to handle this scenario.
Kindly clarify
Upvotes: 0
Views: 178
Reputation: 19
Duplicated files with similar filanames are not possible in HDFS. If you worry about two files with possible similar content, you might want to load it as is to avoid missing data and maintain a managed table that handles the duplicates.
Use case: Get only latest file
Detect latest file from HDFS directory:
hdfs dfs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3
Then, move it to another HDFS directory. This directory should be emptied because we want latest file only.
# delete old files in here
hdfs dfs -rm -r /your/hdfs/latest_dir/
# copy latest file
hdfs dfs -cp $(hadoop fs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3) /your/hdfs/latest_dir/
Upvotes: 0