How to delete the most recently created files in multiple HDFS directories?

Question

I made a mistake and have added a few hundred part files to a table partitioned by date. I am able to see which files are new (these are the ones I want to remove). Most cases I've seen on here relate to deleting files older than a certain date, but I only want to remove my most recent files.

For a single day, I may have 3 files as such, and I want to only remove the newfile. I can tell it's new because of the update timestamp when I use hadoop fs -ls

/this/is/my_directory/event_date1_newfile_20191114
/this/is/my_directory/event_date1_oldfile_20190801
/this/is/my_directory/event_date1_oldfile_20190801

I have many dates, so I'll have to complete this for event_date2, event_date3, etc etc, always removing the 'new_file_20191114' from each date.

The older dates are from August 2019, and my newfiles were updated yesterday, on 11/14/19.

I feel like there should be an easy/quick solution to this, but I'm having trouble finding the reverse case from what most folks have asked about.

Strick · Accepted Answer

AS mentioned in your answer you have got the list of files that needs to be deleted. Create a simple script redirect the output to temp file

like this

hdfs dfs -ls /tmp | sort -k6,7 > files.txt

Please note sort -k6,7 this will give all the files but in sorted order of timestamp. I am sure you dont want to delete all thus you can select the top n files that needs to be deleted lets say 100

then you can update your command to

hdfs dfs -ls /tmp | sort -k6,7 | head -100 |  awk '{print $8}' > files.txt

or if you know specific timestamp of your new files then you can try below command

hdfs dfs -ls /tmp | sort -k6,7 | grep "" |  awk '{print $8}' > files.txt

Then read that file and delete all files one by one

while read file; do
  hdfs -rm $file
  echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted

done



So you complete script can be like

#!/bin/bash

 hdfs dfs -ls /tmp | sort -k6,7 | grep "" |  awk '{print $8}' > files.txt

 while read file; do
     hdfs -rm $file
     echo "Deleted $file" >> deleted_files.txt #this is to track which files have been deleted

   done

How to delete the most recently created files in multiple HDFS directories?

Answers (1)

Related Questions