yuan0122
yuan0122

Reputation: 437

How to find the max file size in a hdfs directory

I want to find the max size of files in a HDFS directory. Does anyone have any idea how to find it? I'm in Hadoop 2.6.0.

I found hadoop fs -ls -S /url which can Sort output by file size from Hadoop 2.7.0 document, but it's not supported in 2.6.0. So is there any similar function that can sort output files by size? Thank you!

Upvotes: 0

Views: 5297

Answers (3)

MukeshKoshyM
MukeshKoshyM

Reputation: 544

Please try below command.

hadoop fs -du Folder | sort -n -r | head -n 1

Upvotes: 0

BruceWayne
BruceWayne

Reputation: 3374

Try this to find which is max hdfs dfs -ls -h /path | sort -r -n -k 5

Upvotes: 0

Manjunath Ballur
Manjunath Ballur

Reputation: 6343

You can make use of hdfs fsck command to get the file sizes.

For e.g., when I execute hdfs fsck /tmp/ -files, then I get the following output:

/tmp <dir>
/tmp/100GB <dir>
/tmp/100GB/Try <dir>
/tmp/100GB/Try/1.txt 5 bytes, 1 block(s):  OK
/tmp/100GB/_SUCCESS 0 bytes, 0 block(s):  OK
/tmp/100GB/part-m-00000 107374182400 bytes, 800 block(s):  OK
/tmp/100GB/part-m-00001._COPYING_ 44163923968 bytes, 330 block(s):
/tmp/10GB <dir>
/tmp/10GB/_SUCCESS 0 bytes, 0 block(s):  OK
/tmp/10GB/part-m-00000 10737418300 bytes, 81 block(s):  OK
/tmp/1GB <dir>
/tmp/1GB/_SUCCESS 0 bytes, 0 block(s):  OK
/tmp/1GB/part-m-00000 1073741900 bytes, 9 block(s):  OK
/tmp/1GB/part-m-00001 1073741900 bytes, 9 block(s):  OK

It recursively lists all the files under /tmp along with their sizes.

Now, to parse out the file with max size, you can execute the following command:

hdfs fsck /tmp/ -files | grep "/tmp/" | grep -v "<dir>" | gawk '{print $2, $1;}'  | sort -n 

This command does the following:

  • hdfs fsck /tmp/ -files - It runs HDFS file system check on the folder /tmp/ and seeks report for each of the files under /tmp/
  • grep "/tmp/" - It greps for /tmp/ (the folder which we want to search). This will give only files and folders under /tmp/
  • "grep -v "<dir>"" - This removes the directories from the output (since we only want files)
  • gawk '{print $2, $1;}' - This prints the file size ($2), followed by the file name ($1)
  • sort -n - This does a numeric sort on the file size and the last file in the list should be the file with the largest size

You can pipe the output to tail -1 to get the largest file.

For e.g. I got output as:

107374182400 /tmp/100GB/part-m-00000 

Upvotes: 1

Related Questions