Reputation: 311
Can hadoop fs -ls be used to find all directories older than N days (from the current date)?
I am trying to write a clean up routine to find and delete all directories on HDFS (matching a pattern) which were created N days prior to the current date.
Upvotes: 10
Views: 21304
Reputation: 45321
I didn't have the HdfsFindTool
, nor the fsimage
from curl
, and I didn't much like the ls
to grep
with while
loop using date
awk
and hadoop
and awk
again.
But I appreciated the answers.
I felt like it could be done with just one ls
, one awk
, and maybe an xargs
.
I also added the options to list the files or summarize them before choosing to delete them, as well as choose a specific directory. Lastly I leave the directories and only concern myself about the files.
#!/bin/bash
USAGE="Usage: $0 [N days] (list|size|delete) [path, default /tmp/hive]"
if [ ! "$1" ]; then
echo $USAGE
exit 1
fi
AGO="`date --date "$1 days ago" "+%F %R"`"
echo "# Will search for files older than $AGO"
if [ ! "$2" ]; then
echo $USAGE
exit 1
fi
INPATH="${3:-/tmp/hive}"
echo "# Will search under $INPATH"
case $2 in
list)
hdfs dfs -ls -R "$INPATH" |\
awk '$1 ~ /^[^d]/ && ($6 " " $7) < '"\"$AGO\""
;;
size)
hdfs dfs -ls -R "$INPATH" |\
awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {
sum += $5 ; cnt += 1} END {
print cnt, "Files with total", sum, "Bytes"}'
;;
delete)
hdfs dfs -ls -R "$INPATH" |\
awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {print $8}' | \
xargs hdfs dfs -rm -skipTrash
;;
*)
echo $USAGE
exit 1
;;
esac
I hope others find this useful.
Upvotes: 2
Reputation: 10650
This script lists all the directories that are older than [days]
:
#!/bin/bash
usage="Usage: $0 [days]"
if [ ! "$1" ]
then
echo $usage
exit 1
fi
now=$(date +%s)
hadoop fs -lsr | grep "^d" | while read f; do
dir_date=`echo $f | awk '{print $6}'`
difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
if [ $difference -gt $1 ]; then
echo $f;
fi
done
Upvotes: 18
Reputation: 14494
If you happen to be using CDH
distribution of Hadoop, it comes with a very useful HdfsFindTool command, which behaves like Linux's find
command.
If you're using the default parcels information, here's how you'd do it:
hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-*-job.jar \
org.apache.solr.hadoop.HdfsFindTool -find PATH -mtime +N
Where you'd replace PATH with the search path and N with number of days.
Upvotes: 7
Reputation: 665
For real clusters it is not a good idea, to use ls. If you have admin rights, it is more suitable to use fsimage.
I modify script above to illustrate idea.
first, fetch fsimage
curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
convert it to text (same output as lsr gives)
hdfs oiv -i img.dump -o fsimage.txt
Script:
#!/bin/bash
usage="Usage: dir_diff.sh [days]"
if [ ! "$1" ]
then
echo $usage
exit 1
fi
now=$(date +%s)
curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
hdfs oiv -i img.dump -o fsimage.txt
cat fsimage.txt | grep "^d" | while read f; do
dir_date=`echo $f | awk '{print $6}'`
difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
if [ $difference -gt $1 ]; then
echo $f;
fi
done
Upvotes: 4