Sandra Schlichting
Sandra Schlichting

Reputation: 26006

Getting file size of each file on a very large file system

I have to move the 20TB file system with a couple of million of files to a ZFS file system. So I would like to get an idea of the file sizes in order to make a good block size selection.

My current idea is to `stat --format="%s" each file and then divide the files into bins.

#!/bin/bash

A=0 # nr of files <= 2^10
B=0 # nr of files <= 2^11
C=0 # nr of files <= 2^12
D=0 # nr of files <= 2^13
E=0 # nr of files <= 2^14
F=0 # nr of files <= 2^15
G=0 # nr of files <= 2^16
H=0 # nr of files <= 2^17
I=0 # nr of files >  2^17

for f in $(find /bin -type f); do

    SIZE=$(stat --format="%s" $f)

    if [ $SIZE -le 1024 ]; then
    let $A++
    elif [ $SIZE -le 2048 ]; then
    let $B++
    elif [ $SIZE -le 4096 ]; then
    let $C++
    fi
done

echo $A
echo $B
echo $C

The problem with this script is that I can't get find to work inside a for-loop.

Question

How to fix my script?

And is there a better way to get the all the file sizes of a file system?

Upvotes: 1

Views: 372

Answers (5)

msw
msw

Reputation: 43507

The venerable du command would provide you with sizes more directly.

Upvotes: 0

Sandra Schlichting
Sandra Schlichting

Reputation: 26006

find /bin/ -type f -printf "%s\n" > /tmp/a

And then use the following as script.pl < /tmp/a.

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;

my %h = ();

while (<STDIN>) {
    chomp;
    if    ($_ <= 2**10) { $h{1} += 1}
    elsif ($_ <= 2**11) { $h{2} += 1}
    elsif ($_ <= 2**12) { $h{4} += 1}
    elsif ($_ <= 2**13) { $h{8} += 1}
    elsif ($_ <= 2**14) { $h{16} += 1}
    elsif ($_ <= 2**15) { $h{32} += 1}
    elsif ($_ <= 2**16) { $h{64} += 1}
    elsif ($_ <= 2**17) { $h{128} += 1}
    elsif ($_ >  2**17) { $h{big} += 1}
}

print Dumper \%h;

Upvotes: 0

Lurk21
Lurk21

Reputation: 2337

I would investigate using dd to read the zfs metadata, which should be contained on the data disks themselves.

That might be a bad suggestion and could result in you wasting time. But crawling the filesystem with bash is going to take a long time and chew system cpu utilization.

Upvotes: 0

jhole
jhole

Reputation: 4236

If just want to find out the number of files between say 100M and 1000M you can do the following

find . -size +100M -size -1000M  -type f | wc -l

Upvotes: 1

Celada
Celada

Reputation: 22261

The main problem is that you are using command substitution to feed the output of find to the for loop. Command substitution works by running the command within parentheses (or backticks) to completion, collecting its output, and substituting it in to the script. That doesn't support streaming, meaning the for loop won't run until the find scan is completely done, and you'll need lots of memory to buffer the output of find too.

Especially because you are scanning many terabytes worth of files, you will want to use something that supports streaming, like a while loop:

find /bin -type f | while read f; do
    ...
done

With something that can stream, your script will at least work, but keep in mind that this technique forces you to invoke an external command (stat) once for each and every file that is found. This will incur a lot of process creation, destruction, and startup cost for the stat command. If you have GNU find, something that outputs the size of each file right in the find command with its -printf option for example would perform much better.

Aside: the let statements in the body of the loop look wrong. You are expanding the contents of the $A, $B, and $C variables instead of referencing them. You shouldn't use $ here.

Upvotes: 2

Related Questions