Reputation: 785
I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but I only want to remove my most recent files.
For a single day, I may have 3 files as such, and I want to only remove the newfile. I can tell it's new because of the update timestamp when I use hadoop fs -ls
/this/is/my_directory/event_date1_newfile_20191114
/this/is/my_directory/event_date1_oldfile_20190801
/this/is/my_directory/event_date1_oldfile_20190801
I have many dates, so I'll have to complete this for event_date2, event_date3, etc etc, always removing the 'new_file_20191114' from each date.
The older dates are from August 2019, and my newfiles were updated yesterday, on 11/14/19.
I feel like there should be an easy/quick solution to this, but I'm having trouble finding the reverse case from what most folks have asked about.
Upvotes: 0
Views: 784
Reputation: 1642
AS mentioned in your answer you have got the list of files that needs to be deleted. Create a simple script redirect the output to temp file
like this
hdfs dfs -ls /tmp | sort -k6,7 > files.txt
Please note sort -k6,7 this will give all the files but in sorted order of timestamp. I am sure you dont want to delete all thus you can select the top n files that needs to be deleted lets say 100
then you can update your command to
hdfs dfs -ls /tmp | sort -k6,7 | head -100 | awk '{print $8}' > files.txt
or if you know specific timestamp of your new files then you can try below command
hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" | awk '{print $8}' > files.txt
Then read that file and delete all files one by one
while read file; do
hdfs -rm $file
echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
done <files.txt
So you complete script can be like
#!/bin/bash
hdfs dfs -ls /tmp | sort -k6,7 | grep "<exact_time_stamp>" | awk '{print $8}' > files.txt
while read file; do
hdfs -rm $file
echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted
done <files.txt
Upvotes: 2