Reputation: 19
I have a list of files in HDFS which has to be filtered out for the latest n hr period through bash scripting
$ find . -name "*" -type f \
-newermt "2019-09-22 23:59:59" \
! -newermt "2019-09-23 23:59:59" \
-exec ls -lt --time-style=long-iso {} +
tried but still stuck with date level filteration cannot proceed futher
Expected is to filter out files in HDFS for any n hr period of the day
Upvotes: 1
Views: 2575
Reputation: 26471
This is an adaptation of this answer:
note: I was unable to test this, but you could test this step by step by looking at the output:
Normally I would say Never parse the output of ls
, but with Hadoop, you don't have a choice here as there is no equivalent to find
. (Since 2.7.0 there is a find, but it is very limited according to the documentation)
Step 1: recursive ls
$ hadoop fs -ls -R /path/to/folder/
Step 2: use awk to pick files only. Directories are recognized by their permissions that start with d
, so we have to exclude those.
$ hadoop fs -ls -R /path/to/folder/ | awk '!/^d/'
make sure you do not end up with funny lines here which are empty or just the directory name ...
Step 3: use awk to process the time interval and select the directories out. I am assuming you have any standard awk, so I will not use GNU extensions. Hadoop will output the time format as yyyy-MM-dd HH:mm
. This format can be sorted and is located in fields 6 and 7. The example below filters out all files that are between "2019-09-21 22:00" and "2019-09-21 23:00":
$ hadoop fs -ls -R /path/to/folder/ \
| awk -v tStart="2019-09-21 22:00" -v tEnd="2019-09-21 23:00" \
'(!/^d/) && (($6" "$7) >= tStart) && (($6" "$7") <= tEnd)'
Upvotes: 4