GNU parallel: parallel files by id

Question

I would like to parallelize a script. I know a bite gnu-parallel, but maybe it is not really appropriate for my problem. I have several GFF input files (column are separated by tabulations) and I would like to parallelize them for my script. All of files contain the same ids.

File 1 :
id1 ...
id2 ...
id2 ...
id3 ...

File 2 :
id2 ...
id3 ...
id3 ...

The two files are different : the number of line is not the same, ids are identical but not necessary present everywhere ( I find this answer How to make gnu-parallel split multiple input files, but the number of lines is the same in all input files) . I do not want to concatenate them because I want to keep the information from which data set they come from. And I do not want to change the GFF format. For the moment, I am splitting my files by id and running my script. I need to keep all of id1 together (id2 together, etc) , but my script can take several id at the same time. And I do not need to run the combination File1 id1 - File2 id2, just File1 id1, File1 id2 - File2 id2 etc. As sometimes one id has not a lot of data it can be run with other ids (run1 : File1 id1, File1 id2 - File2 id2 ; run2 : File1 id3 - File2 id3, etc) . So is it possible to split efficiently my input data, by making some groups depending of the id and the amount of data for each ?

Thanks

Ole Tange · Accepted Answer

From your question it is really hard to understand what you are trying to do. If I got it wrong, please show us examples of what you expect to be run.

I assume your program reads from stdin and that you want the IDs grouped, so you get all the id1s in a single run and do not chop a group into multiple calls.

My suggestion is to merge File1 and File2, insert a marker before each ID group, let GNU Parallel read a block using the marker as record separator, remove the record separators and pass that to yourprog:

If File1+File2 are sorted:

sort -m File1.gff File2.gff |

If not:

sort File1.gff File2.gff |

Insert marker:

perl -pe '/^([^	]+)/; if($1 ne $l) { print "Ma
ke
"; } $l=$1;' |

Look for Ma ke , split into 10MB blocks, remove markers, pass to yourprog:

parallel --pipe --recstart 'Ma
ke
' --rrs --block 10M yourprog

Edit (20220918):

Today you would use --group-by.

GNU parallel: parallel files by id

Answers (2)

Related Questions