cyrusjan
cyrusjan

Reputation: 647

Conflict between GNU parallel and awk (split a column and filter some rows)

I am working on many large gz file like the below examples (only the first 5 rows are showed here).

gene_id variant_id  tss_distance    ma_samples  ma_count    maf pval_nominal    slope   slope_se
ENSG00000223972.4   1_13417_C_CGAGA_b37 1548    50  50  0.0766871   0.735446    -0.0468165  0.138428
ENSG00000223972.4   1_17559_G_C_b37 5690    7   7   0.00964187  0.39765 -0.287573   0.339508
ENSG00000223972.4   1_54421_A_G_b37 42552   28  28  0.039548    0.680357    0.0741142   0.179725
ENSG00000223972.4   1_54490_G_A_b37 42621   112 120 0.176471    0.00824733  0.247533    0.093081

Below is the output that I want.

Here, I split the second column by "_", and selected the rows based on the second and third columns (after splitting) ($2==1 and $3>20000). And I save it as a txt. The command below works perfectly.

zcat InputData.txt.gz | awk -F "_"  '$1=$1' | awk '{if ($2==1 && $3>20000) {print}}'  > OutputData.txt

ENSG00000223972.4   1 54421 A G b37 42552   28  28  0.039548    0.680357    0.0741142   0.179725
ENSG00000223972.4   1 54490 G A b37 42621   112 120 0.176471    0.00824733  0.247533    0.093081

But I want to use GNU parallel to speed up the process since I have many large gz files to work with. However, there seems to be some conflict between GNU parallel and awk, probably in terms of the quotation?

I tried defining the awk option separately as below, but it did not give me anything in the output file.

In the below command, I am only running the parallel on one input file. But I want to run in on multiple input files, and save multiple output files each corresponding to one input file.

For example,

InputData_1.txt.gz to OutputData_1.txt

InputData_2.txt.gz to OutputData_2.txt

awk1='{ -F "_"  "$1=$1" }'
awk2='{if ($2==1 && $3>20000) {print}}' 
parallel "zcat {} | awk '$awk1' |awk '$awk2' > OutputData.txt" ::: InputData.txt.gz

Does anyone have any suggestion on this task? Thank you very much.


According to the suggestion from @karakfa, this is one solution

chr=1
RegionStart=10000
RegionEnd=50000
zcat InputData.txt.gz | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("OutputData.txt")}' 

#This also works using parallel

awkbody='{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("{}_OutputData.txt")}'
parallel "zcat {} | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '$awkbody' " ::: InputData_*.txt.gz

The output file name for the input file InputData_1.txt.gz will be InputData_1.txt.gz_OutputData.txt

Upvotes: 1

Views: 324

Answers (3)

Ole Tange
Ole Tange

Reputation: 33740

https://www.gnu.org/software/parallel/man.html#QUOTING concludes:

Conclusion: To avoid dealing with the quoting problems it may be easier just to write a small script or a function (remember to export -f the function) and have GNU parallel call that.

So:

doit() {
  zcat "$1" |
    awk -F "_"  '$1=$1' |
    awk '{if ($2==1 && $3>20000) {print}}'
}
export -f doit
parallel 'doit {} > {=s/In/Out/; s/.gz//=}' ::: InputData*.txt.gz

Upvotes: 2

Dudi Boy
Dudi Boy

Reputation: 4900

The simple solution is to combine the filter into single awk script, than and only than parallel can work.

Here is a sample solution that scan the whole input.txt only once (twice the performance):

awk 'BEGIN{FS="[ ]*[_]?"}$2==1 && $7 > 20000 {print}' input.txt

Explanation:

BEGIN{FS="[ ]*[_]?"} Make the field separator multiple " " or "_"

$2==1 && $7 > 20000 {print} Print only lines with 2nd field == 1 and 7nt field > 2000

Sample debug script:

BEGIN{FS="[ ]*[_]?"}
{
    for(i = 1; i <= NF; i++) printf("$%d=%s%s",i, $i, OFS);
    print "";
}
$2==1 && $7 > 20000 {print}

Produce:

$1=gene $2=id $3=variant $4=id $5=tss $6=distance $7=ma $8=samples $9=ma $10=count $11=maf $12=pval $13=nominal $14=slope $15=slope $16=se 
$1=ENSG00000223972.4 $2=1 $3=13417 $4=C $5=CGAGA $6=b37 $7=1548 $8=50 $9=50 $10=0.0766871 $11=0.735446 $12=-0.0468165 $13=0.138428 
$1=ENSG00000223972.4 $2=1 $3=17559 $4=G $5=C $6=b37 $7=5690 $8=7 $9=7 $10=0.00964187 $11=0.39765 $12=-0.287573 $13=0.339508 
$1=ENSG00000223972.4 $2=1 $3=54421 $4=A $5=G $6=b37 $7=42552 $8=28 $9=28 $10=0.039548 $11=0.680357 $12=0.0741142 $13=0.179725 
ENSG00000223972.4   1_54421_A_G_b37 42552   28  28  0.039548    0.680357    0.0741142   0.179725
$1=ENSG00000223972.4 $2=1 $3=54490 $4=G $5=A $6=b37 $7=42621 $8=112 $9=120 $10=0.176471 $11=0.00824733 $12=0.247533 $13=0.093081 
ENSG00000223972.4   1_54490_G_A_b37 42621   112 120 0.176471    0.00824733  0.247533    0.093081

Upvotes: 0

karakfa
karakfa

Reputation: 67547

one way of doing this is with split

$ awk '{split($2,f2,"_")} 
   f2[1]==1 && f2[2]>20000 {gsub("_"," ",$2); print > (FILENAME".output")}' file

however, if you provide data though stdin, awk won't capture a filename to write to. You need to pass it to the script as a variable perhaps...

Upvotes: 1

Related Questions