Reputation: 926
I want to write a data retention shell script which when given two inputs - base directory and retention period(in days) deletes ONLY FILES(not directories) older than the retention period. I have searched on the Internet and there have been some solutions but they are listing the directories and deleting them based on the modification time.
But a directory may have a very old timestamp but may contain recently updated files.
How do I proceed? The mindepth
and maxdepth
option in find
command do not work in HDFS.
The base directory may have multiple sub-directories which may have sub-directories and so on.
base
directory is /user/abhikaushik
Then we have sub folders in the form of yyyy/mm/dd/hh
like base/2017/04/23/22
or base/studies/programming/file1.txt
and so on
Upvotes: 2
Views: 10457
Reputation: 3914
Small improvement from Rahul Sharma's answer :
hdfs dfs -rm -r `hadoop fs -ls -R <location> | grep -v '.*2023-.*' | awk '{print $8}' | tac`
This will delete the files that were not made in 2023 (not part of the improvement but it could help someone).
At the end of the command we invert the list with tac
because the ls
will first show the folder and it's sub folders and files :
TOTO/
TOTO/TITI/
TOTO/TITI/TUTU.parquet
TOTO/TITI/TATA.parquet
This way you will delete the files and folders in the right orders and not get errors that the file is not found :
TOTO/TITI/TATA.parquet
TOTO/TITI/TUTU.parquet
TOTO/TITI/
TOTO/
Upvotes: 0
Reputation: 5834
Try this-
Delete all the files created in 2017-0 to 8.
hadoop fs -rm -r `hadoop fs -ls -R <location> | grep '.*2017-[0-8].*' | awk '{print $8}'`
Upvotes: 2
Reputation: 3421
How about this:
hdfs dfs -ls -R /MY/BASE/DIR/PATH | grep "^-" | tr -s " " | cut -d' ' -f6-8 | awk 'BEGIN{ RETENTION_DAYS=10; LAST=24*60*60*RETENTION_DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print $3 }}'
where,
List all the files recursively:
hdfs dfs -ls -R /MY/BASE/DIR/PATH
Get only FILES from the list:
grep "^-"
Replace extra spaces:
tr -s " "
Get the required columns:
cut -d' ' -f6-8
Processing using awk:
awk
Initialize the DIFF duration and current time:
RETENTION_DAYS=10;
LAST=24*60*60*RETENTION_DAYS;
"date +%s" | getline NOW
Create a command to get the epoch value for timestamp of the file on HDFS:
cmd="date -d'\''"$1" "$2"'\'' +%s";
Execute the command to get epoch value for HDFS file:
cmd | getline WHEN;
Get the time difference:
DIFF=NOW-WHEN;
Print the output depending upon the difference:
if(DIFF > LAST){ print $3 }}
Proceed once you are sure that above command lists the files you want to delete
Now, instead of doing a print
operation in last step, you can do what you actually want i.e. delete the older FILES, like this:
hdfs dfs -ls -R /MY/BASE/DIR/PATH | grep "^-" | tr -s " " | cut -d' ' -f6-8 | awk 'BEGIN{ RETENTION_DAYS=10; LAST=24*60*60*RETENTION_DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ system("hdfs dfs -rm -r -skipTrash "$3 ) }}'
You just need to change the values for /MY/BASE/DIR/PATH
and RETENTION_DAYS
depending upon your requirement (here its 10 days).
Hope this helps!
Upvotes: 13