user1968963
user1968963

Reputation: 2521

Bash: parallelize md5sum checksum on many files

lets say, I have a 64-core server, and I need to compute md5sum of all files in /mnt/data, and store the results in a text file:

find /mnt/data -type f -exec md5sum {} \; > md5.txt

The problem with the above command is, that only one process runs at any given time. I would like to harness the full power of my 64-cores. Ideally, I would like to makes sure, that at any given time, 64 parallel md5 processes are running (but not more than 64).

Also. I would need output from all the processes to be stored into one file.

NOTE: I am not looking for a way to compute md5sum of one file in parallel. I am looking for a way to compute 64 md5sums of 64 different files in parallel, as long as there are any files coming from find.

Upvotes: 30

Views: 16493

Answers (4)

Tony
Tony

Reputation: 1254

You can use xargs as well, It might be more available than parallels on some distro.

-P controls the number of process spawned.

find /mnt/data -type f | xargs -L1 -P24  md5sum > /tmp/result.txt

Upvotes: 18

TrueY
TrueY

Reputation: 7610

UPDATED

If You do not want to use additional packages You can try sg like this:

#!/usr/bin/bash

max=5;
cpid=()

# Enable job control to receive SIGCHLD
set -m
remove() {
  for i in ${!cpid[*]}; do
    [ ! -d /proc/$i ] && echo UNSET $i && unset cpid[$i] && break
  done
}
trap remove SIGCHLD

for x in $(find ./ -type f -name '*.sh'); do
  some_long_process $x&
  cpid[$!]="$x";
  while [ ${#cpid[*]} -ge $max ]; do
    echo DO SOMETHING && sleep 1;
  done
done
wait

It first enables to receive SIGCHLD if a subprocess exits. If SIGCHLD it finds the first non-existing process and removes from cpid array.

In the for-loop it starts max number of some_long_process processes asynchronously. It max reached it polls all pids added to cpid array. It waits until cpid's length is less then max and starts some more processes asynchronously.

If the list is over then it waits for all children to finish.

ADDED

Finally I have found a proper solution here.

Upvotes: 2

clt60
clt60

Reputation: 63974

If you want experiment try install the md5deep. (http://md5deep.sourceforge.net)

Here is the manual where you can read:

-jnn Controls multi-threading. By default the program will create one producer thread to scan the file system and one hashing thread per CPU core. Multi-threading causes output filenames to be in non-deterministic order, as files that take longer to hash will be delayed while they are hashed. If a deterministic order is required, specify -j0 to disable multi-threading

If this not helps, you have I/O bottleneck.

Upvotes: 9

Steve
Steve

Reputation: 54592

Use GNU parallel. And you can find some more examples on how to implement it here.

find /mnt/data -type f | parallel -j 64 md5sum > md5.txt

Upvotes: 35

Related Questions