Reputation: 89
I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="\t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'\t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat
for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
Upvotes: 0
Views: 67
Reputation: 2239
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) &
and then wait
for all of them to finish. If this takes 1 second and the awk
part before takes 10 minutes then this will not help a lot.
You can also put the wait
before cat output 1 tmp5...
and before the final sort -k1...
line. For this to work you'll have to name the temporary files differently and rename them just before the join
s. The idea is to generate the input for the three parallel joins for the first file in background, wait
, then rename the files, run the join
s in background and generate the next inputs. After the loop is complete just wait the last join
s to finish. This will help if the awk
part consumes comparable to the join
s CPU time.
HTH, you can make even more complex parallel execution scenarios.
Upvotes: 1