Reputation: 647
I am working on many large gz file like the below examples (only the first 5 rows are showed here).
gene_id variant_id tss_distance ma_samples ma_count maf pval_nominal slope slope_se
ENSG00000223972.4 1_13417_C_CGAGA_b37 1548 50 50 0.0766871 0.735446 -0.0468165 0.138428
ENSG00000223972.4 1_17559_G_C_b37 5690 7 7 0.00964187 0.39765 -0.287573 0.339508
ENSG00000223972.4 1_54421_A_G_b37 42552 28 28 0.039548 0.680357 0.0741142 0.179725
ENSG00000223972.4 1_54490_G_A_b37 42621 112 120 0.176471 0.00824733 0.247533 0.093081
Below is the output that I want.
Here, I split the second column by "_", and selected the rows based on the second and third columns (after splitting) ($2==1 and $3>20000). And I save it as a txt. The command below works perfectly.
zcat InputData.txt.gz | awk -F "_" '$1=$1' | awk '{if ($2==1 && $3>20000) {print}}' > OutputData.txt
ENSG00000223972.4 1 54421 A G b37 42552 28 28 0.039548 0.680357 0.0741142 0.179725
ENSG00000223972.4 1 54490 G A b37 42621 112 120 0.176471 0.00824733 0.247533 0.093081
But I want to use GNU parallel to speed up the process since I have many large gz files to work with. However, there seems to be some conflict between GNU parallel and awk, probably in terms of the quotation?
I tried defining the awk option separately as below, but it did not give me anything in the output file.
In the below command, I am only running the parallel on one input file. But I want to run in on multiple input files, and save multiple output files each corresponding to one input file.
For example,
InputData_1.txt.gz to OutputData_1.txt
InputData_2.txt.gz to OutputData_2.txt
awk1='{ -F "_" "$1=$1" }'
awk2='{if ($2==1 && $3>20000) {print}}'
parallel "zcat {} | awk '$awk1' |awk '$awk2' > OutputData.txt" ::: InputData.txt.gz
Does anyone have any suggestion on this task? Thank you very much.
According to the suggestion from @karakfa, this is one solution
chr=1
RegionStart=10000
RegionEnd=50000
zcat InputData.txt.gz | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("OutputData.txt")}'
#This also works using parallel
awkbody='{split($2,NewDF,"_")} NewDF[1]==chr && NewDF[2]>start && NewDF[2]<end {gsub("_"," ",$2) ; print > ("{}_OutputData.txt")}'
parallel "zcat {} | awk -v chr=$chr -v start=$RegionStart -v end=$RegionEnd '$awkbody' " ::: InputData_*.txt.gz
The output file name for the input file InputData_1.txt.gz
will be InputData_1.txt.gz_OutputData.txt
Upvotes: 1
Views: 324
Reputation: 33740
https://www.gnu.org/software/parallel/man.html#QUOTING concludes:
Conclusion: To avoid dealing with the quoting problems it may be easier just to write a small script or a function (remember to export -f the function) and have GNU parallel call that.
So:
doit() {
zcat "$1" |
awk -F "_" '$1=$1' |
awk '{if ($2==1 && $3>20000) {print}}'
}
export -f doit
parallel 'doit {} > {=s/In/Out/; s/.gz//=}' ::: InputData*.txt.gz
Upvotes: 2
Reputation: 4900
The simple solution is to combine the filter into single awk
script, than and only than parallel can work.
Here is a sample solution that scan the whole input.txt
only once (twice the performance):
awk 'BEGIN{FS="[ ]*[_]?"}$2==1 && $7 > 20000 {print}' input.txt
BEGIN{FS="[ ]*[_]?"}
Make the field separator multiple " " or "_"
$2==1 && $7 > 20000 {print}
Print only lines with 2nd field == 1 and 7nt field > 2000
Sample debug script:
BEGIN{FS="[ ]*[_]?"}
{
for(i = 1; i <= NF; i++) printf("$%d=%s%s",i, $i, OFS);
print "";
}
$2==1 && $7 > 20000 {print}
Produce:
$1=gene $2=id $3=variant $4=id $5=tss $6=distance $7=ma $8=samples $9=ma $10=count $11=maf $12=pval $13=nominal $14=slope $15=slope $16=se
$1=ENSG00000223972.4 $2=1 $3=13417 $4=C $5=CGAGA $6=b37 $7=1548 $8=50 $9=50 $10=0.0766871 $11=0.735446 $12=-0.0468165 $13=0.138428
$1=ENSG00000223972.4 $2=1 $3=17559 $4=G $5=C $6=b37 $7=5690 $8=7 $9=7 $10=0.00964187 $11=0.39765 $12=-0.287573 $13=0.339508
$1=ENSG00000223972.4 $2=1 $3=54421 $4=A $5=G $6=b37 $7=42552 $8=28 $9=28 $10=0.039548 $11=0.680357 $12=0.0741142 $13=0.179725
ENSG00000223972.4 1_54421_A_G_b37 42552 28 28 0.039548 0.680357 0.0741142 0.179725
$1=ENSG00000223972.4 $2=1 $3=54490 $4=G $5=A $6=b37 $7=42621 $8=112 $9=120 $10=0.176471 $11=0.00824733 $12=0.247533 $13=0.093081
ENSG00000223972.4 1_54490_G_A_b37 42621 112 120 0.176471 0.00824733 0.247533 0.093081
Upvotes: 0
Reputation: 67547
one way of doing this is with split
$ awk '{split($2,f2,"_")}
f2[1]==1 && f2[2]>20000 {gsub("_"," ",$2); print > (FILENAME".output")}' file
however, if you provide data though stdin, awk
won't capture a filename to write to. You need to pass it to the script as a variable perhaps...
Upvotes: 1