Reputation: 26006
I have to move the 20TB file system with a couple of million of files to a ZFS file system. So I would like to get an idea of the file sizes in order to make a good block size selection.
My current idea is to `stat --format="%s" each file and then divide the files into bins.
#!/bin/bash
A=0 # nr of files <= 2^10
B=0 # nr of files <= 2^11
C=0 # nr of files <= 2^12
D=0 # nr of files <= 2^13
E=0 # nr of files <= 2^14
F=0 # nr of files <= 2^15
G=0 # nr of files <= 2^16
H=0 # nr of files <= 2^17
I=0 # nr of files > 2^17
for f in $(find /bin -type f); do
SIZE=$(stat --format="%s" $f)
if [ $SIZE -le 1024 ]; then
let $A++
elif [ $SIZE -le 2048 ]; then
let $B++
elif [ $SIZE -le 4096 ]; then
let $C++
fi
done
echo $A
echo $B
echo $C
The problem with this script is that I can't get find
to work inside a for-loop.
Question
How to fix my script?
And is there a better way to get the all the file sizes of a file system?
Upvotes: 1
Views: 372
Reputation: 26006
find /bin/ -type f -printf "%s\n" > /tmp/a
And then use the following as script.pl < /tmp/a
.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
my %h = ();
while (<STDIN>) {
chomp;
if ($_ <= 2**10) { $h{1} += 1}
elsif ($_ <= 2**11) { $h{2} += 1}
elsif ($_ <= 2**12) { $h{4} += 1}
elsif ($_ <= 2**13) { $h{8} += 1}
elsif ($_ <= 2**14) { $h{16} += 1}
elsif ($_ <= 2**15) { $h{32} += 1}
elsif ($_ <= 2**16) { $h{64} += 1}
elsif ($_ <= 2**17) { $h{128} += 1}
elsif ($_ > 2**17) { $h{big} += 1}
}
print Dumper \%h;
Upvotes: 0
Reputation: 2337
I would investigate using dd to read the zfs metadata, which should be contained on the data disks themselves.
That might be a bad suggestion and could result in you wasting time. But crawling the filesystem with bash is going to take a long time and chew system cpu utilization.
Upvotes: 0
Reputation: 4236
If just want to find out the number of files between say 100M and 1000M you can do the following
find . -size +100M -size -1000M -type f | wc -l
Upvotes: 1
Reputation: 22261
The main problem is that you are using command substitution to feed the output of find
to the for
loop. Command substitution works by running the command within parentheses (or backticks) to completion, collecting its output, and substituting it in to the script. That doesn't support streaming, meaning the for loop won't run until the find
scan is completely done, and you'll need lots of memory to buffer the output of find
too.
Especially because you are scanning many terabytes worth of files, you will want to use something that supports streaming, like a while
loop:
find /bin -type f | while read f; do
...
done
With something that can stream, your script will at least work, but keep in mind that this technique forces you to invoke an external command (stat
) once for each and every file that is found. This will incur a lot of process creation, destruction, and startup cost for the stat
command. If you have GNU find, something that outputs the size of each file right in the find
command with its -printf
option for example would perform much better.
Aside: the let
statements in the body of the loop look wrong. You are expanding the contents of the $A
, $B
, and $C
variables instead of referencing them. You shouldn't use $
here.
Upvotes: 2