rkabhishek
rkabhishek

Reputation: 926

Hadoop - Delete only files older than X days

I want to write a data retention shell script which when given two inputs - base directory and retention period(in days) deletes ONLY FILES(not directories) older than the retention period. I have searched on the Internet and there have been some solutions but they are listing the directories and deleting them based on the modification time.

But a directory may have a very old timestamp but may contain recently updated files.

How do I proceed? The mindepth and maxdepth option in find command do not work in HDFS.

The base directory may have multiple sub-directories which may have sub-directories and so on.

base directory is /user/abhikaushik

Then we have sub folders in the form of yyyy/mm/dd/hh like base/2017/04/23/22 or base/studies/programming/file1.txt and so on

Upvotes: 2

Views: 10457

Answers (3)

WannaGetHigh
WannaGetHigh

Reputation: 3914

Small improvement from Rahul Sharma's answer :

hdfs dfs -rm -r `hadoop fs -ls -R <location> | grep -v '.*2023-.*' | awk '{print $8}' | tac`

This will delete the files that were not made in 2023 (not part of the improvement but it could help someone).

At the end of the command we invert the list with tac because the ls will first show the folder and it's sub folders and files :

TOTO/
TOTO/TITI/
TOTO/TITI/TUTU.parquet
TOTO/TITI/TATA.parquet

This way you will delete the files and folders in the right orders and not get errors that the file is not found :

TOTO/TITI/TATA.parquet
TOTO/TITI/TUTU.parquet
TOTO/TITI/
TOTO/

Upvotes: 0

Rahul Sharma
Rahul Sharma

Reputation: 5834

Try this-

Delete all the files created in 2017-0 to 8.

hadoop fs -rm -r `hadoop fs -ls -R <location> | grep '.*2017-[0-8].*' | awk '{print $8}'`

Upvotes: 2

PradeepKumbhar
PradeepKumbhar

Reputation: 3421

How about this:

hdfs dfs -ls -R /MY/BASE/DIR/PATH | grep "^-" | tr -s " " | cut -d' ' -f6-8 | awk 'BEGIN{ RETENTION_DAYS=10; LAST=24*60*60*RETENTION_DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print $3 }}'

where,

List all the files recursively:

hdfs dfs -ls -R /MY/BASE/DIR/PATH

Get only FILES from the list:

grep "^-"

Replace extra spaces:

tr -s " "

Get the required columns:

cut -d' ' -f6-8

Processing using awk:

awk

Initialize the DIFF duration and current time:

RETENTION_DAYS=10;

LAST=24*60*60*RETENTION_DAYS;

"date +%s" | getline NOW

Create a command to get the epoch value for timestamp of the file on HDFS:

cmd="date -d'\''"$1" "$2"'\'' +%s";

Execute the command to get epoch value for HDFS file:

cmd | getline WHEN;

Get the time difference:

DIFF=NOW-WHEN;

Print the output depending upon the difference:

if(DIFF > LAST){ print $3 }}

--------------------------------------------------------------------------------


Proceed once you are sure that above command lists the files you want to delete

Now, instead of doing a print operation in last step, you can do what you actually want i.e. delete the older FILES, like this:

hdfs dfs -ls -R /MY/BASE/DIR/PATH | grep "^-" | tr -s " " | cut -d' ' -f6-8 | awk 'BEGIN{ RETENTION_DAYS=10; LAST=24*60*60*RETENTION_DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ system("hdfs dfs -rm -r -skipTrash "$3 ) }}'

You just need to change the values for /MY/BASE/DIR/PATH and RETENTION_DAYS depending upon your requirement (here its 10 days).

Hope this helps!

Upvotes: 13

Related Questions