Reputation: 13
I would like to parallelize a script. I know a bite gnu-parallel, but maybe it is not really appropriate for my problem. I have several GFF input files (column are separated by tabulations) and I would like to parallelize them for my script. All of files contain the same ids.
File 1 :
id1 ...
id2 ...
id2 ...
id3 ...File 2 :
id2 ...
id3 ...
id3 ...
The two files are different : the number of line is not the same, ids are identical but not necessary present everywhere ( I find this answer How to make gnu-parallel split multiple input files, but the number of lines is the same in all input files) . I do not want to concatenate them because I want to keep the information from which data set they come from. And I do not want to change the GFF format. For the moment, I am splitting my files by id and running my script. I need to keep all of id1 together (id2 together, etc) , but my script can take several id at the same time. And I do not need to run the combination File1 id1 - File2 id2, just File1 id1, File1 id2 - File2 id2 etc. As sometimes one id has not a lot of data it can be run with other ids (run1 : File1 id1, File1 id2 - File2 id2 ; run2 : File1 id3 - File2 id3, etc) . So is it possible to split efficiently my input data, by making some groups depending of the id and the amount of data for each ?
Thanks
Upvotes: 1
Views: 384
Reputation: 33685
From your question it is really hard to understand what you are trying to do. If I got it wrong, please show us examples of what you expect to be run.
I assume your program reads from stdin and that you want the IDs grouped, so you get all the id1s in a single run and do not chop a group into multiple calls.
My suggestion is to merge File1 and File2, insert a marker before each ID group, let GNU Parallel read a block using the marker as record separator, remove the record separators and pass that to yourprog
:
If File1+File2 are sorted:
sort -m File1.gff File2.gff |
If not:
sort File1.gff File2.gff |
Insert marker:
perl -pe '/^([^\t]+)/; if($1 ne $l) { print "Ma\rke\r"; } $l=$1;' |
Look for Ma\rke\r, split into 10MB blocks, remove markers, pass to yourprog:
parallel --pipe --recstart 'Ma\rke\r' --rrs --block 10M yourprog
Edit (20220918):
Today you would use --group-by
.
Upvotes: 1
Reputation: 33685
Since 20190222 you can use --shard
:
cat *gff | parallel --shard 1 -j8 yourprog
This will look at column 1, compute a hash, and send it to an instance of yourprog
depending on the hash value modulo 8.
Upvotes: 1