Reputation: 116810

How to parallelize my bash script for use with `find` without facing race conditions?

I am trying to execute a command like this:

find ./ -name "*.gz" -print -exec ./extract.sh {} \;

The gz files themselves are small. Currently my extract.sh contains the following:

# Start delimiter
echo "#####" $1 >> Info
zcat $1 > temp
# Series of greps to extract some useful information
grep -o -P "..." temp >> Info
grep -o -P "..." temp >> Info
rm temp
echo "####" >> Info

Obviously, this is not parallelizable because if I run multiple extract.sh instances, they all write to the same file. What is a smart way of doing this?

I have 80K gz files on a machine with massive horse power of 32 cores.

Upvotes: 2

Answers (5)

tripleee

Reputation: 189377

The multiple grep invocations in extract.sh are probably the main bottleneck here. An obvious optimization is to read each file only once, then print a summary in the order you want. As an added benefit, we can speculate that the report can get written as a single block, but it might not prevent interleaved output completely. Still, here's my attempt.

#!/bin/sh

for f; do
    zcat "$f" |
    perl -ne '
        /(pattern1)/ && push @pat1, $1;
        /(pattern2)/ && push @pat2, $1;
        # ...
        END { print "##### '"$1"'\n";
            print join ("\n", @pat1), "\n";
            print join ("\n", @pat2), "\n";
            # ...
            print "#### '"$f"'\n"; }'
done

Doing this in awk instead of Perl might be slightly more efficient, but since you are using grep -P I figure it's useful to be able to keep the same regex syntax.

The script accepts multiple .gz files as input, so you can use find -exec extract.sh {} \+ or xargs to launch a number of parallel processes. With xargs you can try to find a balance between sequential jobs and parallel jobs by feeding each new process, say, 100 to 500 files in one batch. You save on the number of new processes, but lose in parallelization. Some experimentation should reveal what the balance should be, but this is the point where I would just pull a number out of my hat and see if it's good enough already.

Granted, if your input files are small enough, the multiple grep invocations will run out of the disk cache, and turn out to be faster than the overhead of starting up Perl.

Upvotes: 0

Dunes

Reputation: 40703

I would create a temporary directory. Then create an output file for each grep (based on the name of te file it processed). Files created under /tmp are located on a RAM disk and so will not thrash your harddrive with lots of writes.

You can then either cat it all together at the end, or get each grep to signal another process when it has finished and that process can begin catting files immediately (and removing them when done).

Example:

working_dir="`pwd`"
temp_dir="`mktemp -d`"
cd "$temp_dir"
find "$working_dir" -name "*.gz" | xargs -P 32 -n 1 extract.sh 
cat *.output > "$working_dir/Info"
rm -rf "$temp_dir"

extract.sh

 filename=$(basename $1)
 output="$filename.output"
 extracted="$filename.extracted"
 zcat "$1" > "$extracted"

 echo "#####" $filename > "$output"
 # Series of greps to extract some useful information
 grep -o -P "..." "$extracted" >> "$output"
 grep -o -P "..." "$extracted" >> "$output"
 rm "$extracted"
 echo "####" >> "$output"

Upvotes: 0

Bartosz Moczulski

Reputation: 1239

You can use xargs to run your search in parallel. --max-procs limits number of processes executed (default is 1):

find ./ -name "*.gz" -print | xargs --max-args 1 --max-procs 32 ./extract.sh

In the ./extract.sh you can use mktemp to write data from each .gz to a temporary file, all of which may be later combined:

# Start delimiter
tmp=`mktemp -t Info.XXXXXX`
src=$1
echo "#####" $1 >> $tmp
zcat $1 > $tmp.unzip
src=$tmp.unzip

# Series of greps to extract some useful information
grep -o -P "..." $src >> $tmp
grep -o -P "..." $src >> $tmp
rm $src
echo "####" >> $tmp

If you have massive horse power you can use zgrep directly, without unzipping first. But it may be faster to zcat first if you have many greps later.

Anyway, later combine everything into a single file:

cat /tmp/Info.* > Info
rm /tmp/Info.*

If you care about order of .gz files apply second argument to ./extract.sh:

find files/ -name "*.gz" | nl -n rz | sed -e 's/\t/\n/' | xargs --max-args 2 ...

And in ./extract.sh:

tmp=`mktemp -t Info.$1.XXXXXX`
src=$2

Upvotes: 1

Spencer Rathbun

Reputation: 14900

A quick check through the findutils source reveals that find starts a child process for each exec. I believe it then moves on, though I may be misreading the source. Because of this you are already parallel, since the OS will handle sharing these out across your cores. And through the magic of virtual memory, the same executables will mostly share the same memory space.

The problem you are going to run into is file locking/data mixing. As each individual child runs, it pipes info into your info file. These are individual script commands, so they will mix their output together like spaghetti. This does not guarantee that the files will be in order! Just that all of an individual file's contents will stay together.

To solve this problem, all you need to do is take advantage of the shell's ability to create a temporary file (using tempfile), have each script dump to the temp file, then have each script cat the temp file into the info file. Don't forget to delete your temp file after use.

If the tempfiles are in ram(see tmpfs), then you will avoid being IO bound except when writing to your final file, and running the find search.

Tmpfs is a special file system that uses your ram as "disk space". It will take up to the amount of ram you allow, not use more than it needs from that amount, and swap to disk as needed if it does fill up.

To use:

Create a mount point ( I like /mnt/ramdisk or /media/ramdisk )
Edit /etc/fstab as root
Add tmpfs /mnt/ramdrive tmpfs size=1G 0 0
Run umount as root to mount your new ramdrive. It will also mount at boot.

See the wikipedia entry on fstab for all the options available.

Upvotes: 1

stefan bachert

Reputation: 9608

Assume (just for simplicity and clearness) all your files starts with a-z.

So you could use 26 cores in parallel when launching an find sequence like above for each letter. Each "find" need to generate an own aggregate file

find ./ -name "a*.gz" -print -exec ./extract.sh a {} \; &
find ./ -name "b*.gz" -print -exec ./extract.sh b {} \; &
..
find ./ -name "z*.gz" -print -exec ./extract.sh z {} \;

(extract needs to take to first parameter to separate the "info" destination file)

When you want a big aggregate file just joins all aggregate.

However, I am not convinced to gain performance with that approach. In the end all file content will be serialized.

Probably hard disk head movement will be the limitation not the unzip (cpu) performance.

But let's try

Upvotes: 1

How to parallelize my bash script for use with `find` without facing race conditions?

Answers (5)

Related Questions