Reputation: 437
I want to find the max size of files in a HDFS directory. Does anyone have any idea how to find it? I'm in Hadoop 2.6.0.
I found hadoop fs -ls -S /url
which can Sort output by file size
from Hadoop 2.7.0 document, but it's not supported in 2.6.0. So is there any similar function that can sort output files by size? Thank you!
Upvotes: 0
Views: 5297
Reputation: 544
Please try below command.
hadoop fs -du
Folder
| sort -n -r | head -n 1
Upvotes: 0
Reputation: 3374
Try this to find which is max hdfs dfs -ls -h /path | sort -r -n -k 5
Upvotes: 0
Reputation: 6343
You can make use of hdfs fsck
command to get the file sizes.
For e.g., when I execute hdfs fsck /tmp/ -files
, then I get the following output:
/tmp <dir>
/tmp/100GB <dir>
/tmp/100GB/Try <dir>
/tmp/100GB/Try/1.txt 5 bytes, 1 block(s): OK
/tmp/100GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/100GB/part-m-00000 107374182400 bytes, 800 block(s): OK
/tmp/100GB/part-m-00001._COPYING_ 44163923968 bytes, 330 block(s):
/tmp/10GB <dir>
/tmp/10GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/10GB/part-m-00000 10737418300 bytes, 81 block(s): OK
/tmp/1GB <dir>
/tmp/1GB/_SUCCESS 0 bytes, 0 block(s): OK
/tmp/1GB/part-m-00000 1073741900 bytes, 9 block(s): OK
/tmp/1GB/part-m-00001 1073741900 bytes, 9 block(s): OK
It recursively lists all the files under /tmp
along with their sizes.
Now, to parse out the file with max size, you can execute the following command:
hdfs fsck /tmp/ -files | grep "/tmp/" | grep -v "<dir>" | gawk '{print $2, $1;}' | sort -n
This command does the following:
hdfs fsck /tmp/ -files
- It runs HDFS file system check on the folder /tmp/
and seeks report for each of the files under /tmp/
grep "/tmp/"
- It greps for /tmp/
(the folder which we want to search). This will give only files and folders under /tmp/
"grep -v "<dir>""
- This removes the directories from the output (since we only want files)gawk '{print $2, $1;}'
- This prints the file size ($2), followed by the file name ($1)sort -n
- This does a numeric sort on the file size and the last file in the list should be the file with the largest sizeYou can pipe the output to tail -1
to get the largest file.
For e.g. I got output as:
107374182400 /tmp/100GB/part-m-00000
Upvotes: 1