Hill
Hill

Reputation: 89

efficient join >100 files

I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:

where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.

My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.

I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.

My script so far:

ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
    sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="\t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
    awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
    cat output1 tmp5 > foo && mv foo output1
    join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
    join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
    join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
    rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1

Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file

Upvotes: 0

Views: 67

Answers (1)

Doncho Gunchev
Doncho Gunchev

Reputation: 2239

First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.

You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.

You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.

HTH, you can make even more complex parallel execution scenarios.

Upvotes: 1

Related Questions