Reputation: 2521
lets say, I have a 64-core server, and I need to compute md5sum
of all files in /mnt/data
, and store the results in a text file:
find /mnt/data -type f -exec md5sum {} \; > md5.txt
The problem with the above command is, that only one process runs at any given time. I would like to harness the full power of my 64-cores. Ideally, I would like to makes sure, that at any given time, 64 parallel md5
processes are running (but not more than 64).
Also. I would need output from all the processes to be stored into one file.
NOTE: I am not looking for a way to compute md5sum
of one file in parallel. I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files coming from find
.
Upvotes: 30
Views: 16493
Reputation: 1254
You can use xargs as well, It might be more available than parallels on some distro.
-P controls the number of process spawned.
find /mnt/data -type f | xargs -L1 -P24 md5sum > /tmp/result.txt
Upvotes: 18
Reputation: 7610
UPDATED
If You do not want to use additional packages You can try sg like this:
#!/usr/bin/bash
max=5;
cpid=()
# Enable job control to receive SIGCHLD
set -m
remove() {
for i in ${!cpid[*]}; do
[ ! -d /proc/$i ] && echo UNSET $i && unset cpid[$i] && break
done
}
trap remove SIGCHLD
for x in $(find ./ -type f -name '*.sh'); do
some_long_process $x&
cpid[$!]="$x";
while [ ${#cpid[*]} -ge $max ]; do
echo DO SOMETHING && sleep 1;
done
done
wait
It first enables to receive SIGCHLD if a subprocess exits. If SIGCHLD it finds the first non-existing process and removes from cpid
array.
In the for-loop it starts max
number of some_long_process
processes asynchronously. It max
reached it polls all pids added to cpid
array. It waits until cpid
's length is less then max
and starts some more processes asynchronously.
If the list is over then it waits for all children to finish.
ADDED
Finally I have found a proper make solution here.
Upvotes: 2
Reputation: 63974
If you want experiment try install the md5deep
. (http://md5deep.sourceforge.net)
Here is the manual where you can read:
-jnn Controls multi-threading. By default the program will create one producer thread to scan the file system and one hashing thread per CPU core. Multi-threading causes output filenames to be in non-deterministic order, as files that take longer to hash will be delayed while they are hashed. If a deterministic order is required, specify -j0 to disable multi-threading
If this not helps, you have I/O bottleneck.
Upvotes: 9
Reputation: 54592
Use GNU parallel
. And you can find some more examples on how to implement it here.
find /mnt/data -type f | parallel -j 64 md5sum > md5.txt
Upvotes: 35