Manivannan Dharman
Manivannan Dharman

Reputation: 19

Filtering Out Files in HDFS With Time Range

I have a list of files in HDFS which has to be filtered out for the latest n hr period through bash scripting

$ find . -name "*" -type f                  \
    -newermt "2019-09-22 23:59:59"          \
    ! -newermt "2019-09-23 23:59:59"        \
    -exec ls -lt --time-style=long-iso {} +

tried but still stuck with date level filteration cannot proceed futher

Expected is to filter out files in HDFS for any n hr period of the day

Upvotes: 1

Views: 2575

Answers (1)

kvantour
kvantour

Reputation: 26471

This is an adaptation of this answer:

note: I was unable to test this, but you could test this step by step by looking at the output:

Normally I would say Never parse the output of ls, but with Hadoop, you don't have a choice here as there is no equivalent to find. (Since 2.7.0 there is a find, but it is very limited according to the documentation)

Step 1: recursive ls

$ hadoop fs -ls -R /path/to/folder/

Step 2: use to pick files only. Directories are recognized by their permissions that start with d, so we have to exclude those.

$ hadoop fs -ls -R /path/to/folder/ | awk '!/^d/'

make sure you do not end up with funny lines here which are empty or just the directory name ...

Step 3: use to process the time interval and select the directories out. I am assuming you have any standard awk, so I will not use GNU extensions. Hadoop will output the time format as yyyy-MM-dd HH:mm. This format can be sorted and is located in fields 6 and 7. The example below filters out all files that are between "2019-09-21 22:00" and "2019-09-21 23:00":

$ hadoop fs -ls -R /path/to/folder/  \
   | awk -v tStart="2019-09-21 22:00" -v tEnd="2019-09-21 23:00" \
         '(!/^d/) && (($6" "$7) >= tStart) && (($6" "$7") <= tEnd)'

Upvotes: 4

Related Questions