Reputation: 351
I have a data like this:
head out_sign.txt
chr pos gene_id p.val.LCL p.val.Retina
chr1 756164 ENSG00000237094 0 7.15853e-05
chr1 775930 ENSG00000237094 0 9.72251e-05
chr1 814583 ENSG00000237094 0 1.88553e-05
chr1 815963 ENSG00000237094 0 3.85837e-06
chr6 130235069 ENSG00000164484 5.43098e-06 0
chr17 45848600 ENSG00000236234 9.005e-06 0
chr11 34862454 ENSG00000110435 1.81973e-06 0
chr17 45979467 ENSG00000236234 1.34927e-05 0
chr5 81234109 ENSG00000248794 4.36795e-06 0
so sometimes p.val.LCL is ==0 and sometimes p.val.Retina==0.
Every times one of them is equal to zero I would like to look up the corresponding combination of: chr, pos, gene_id in these two files:
for p.val.LCL==0 look up in file called: "lcls.txt" for the combination of chr, pos, gene_id file and replace the zero value with the one found in "lcls.txt"
head lcls.txt
chr pos gene_id pval_nominal
chr1 756164 ENSG00000237094 0.1
chr1 775930 ENSG00000237094 0.4
chr1 814583 ENSG00000237094 0.6
chr1 815963 ENSG00000237094 0.7
for
p.val.Retina==0 look up the combination of chr, pos, gene_id in "ret.txt" file and replace the zero value with the one found in "ret.txt"
head ret.txt
chr pos gene_id pval_ret
chr6 130235069 ENSG00000164484 5.43098e-06
chr17 45848600 ENSG00000236234 9.005e-06
chr11 34862454 ENSG00000110435 1.81973e-06
chr17 45979467 ENSG00000236234 1.34927e-05
chr5 81234109 ENSG00000248794 4.36795e-06
So for this given example the SOLUTION would be:
chr pos gene_id p.val.LCL p.val.Retina
chr1 756164 ENSG00000237094 0.1 7.15853e-05
chr1 775930 ENSG00000237094 0.4 9.72251e-05
chr1 814583 ENSG00000237094 0.6 1.88553e-05
chr1 815963 ENSG00000237094 0.7 3.85837e-06
chr6 130235069 ENSG00000164484 5.43098e-06 5.43098e-06
chr17 45848600 ENSG00000236234 9.005e-06 9.005e-06
chr11 34862454 ENSG00000110435 1.81973e-06 1.81973e-06
chr17 45979467 ENSG00000236234 1.34927e-05 1.34927e-05
chr5 81234109 ENSG00000248794 4.36795e-06 4.36795e-06
The Solution file would have the same number of lines like out_sign.txt.
I should mention that these files are fairly large, number of lines in each file is:
107371529 ret.txt
166941636 lcls.txt
2250213 out_sign.txt
Probably awk solution would be the best because of the size of the data. I tried doing this in R but I am getting memory related error. awk should be able to handle it.
Upvotes: 1
Views: 87
Reputation: 14493
Possible to implement using awk script
awk '
t == "lcls" { if ( FNR > 1 ) lcls[$1, $2, $3] = $4 ; next }
t == "ret" { if ( FNR > 1 ) ret[$1, $2, $3] = $4 ; next }
# Lookup for main file
{
if ( $4 == 0 ) { v = lcls[$1, $2, $3] ; if ( v ) $4 = v ; }
if ( $5 == 0 ) { v = ret[$1, $2, $3] ; if ( v ) $5 = v ; }
print
}
' t=ret ret.txt t=lcls lcls.txt t= out_sign.txt
It load the lookup files into awk tables, and then perform the replacement on the 'data' file. The 't' variable indicate the input 'type'.
Upvotes: 1
Reputation: 207758
I think the approach by @dash-o is maybe more elegant. However, just for fun, here is another method in the vein of a Schwartzian transform, to distinguish the three input files into awk
, drop the header line and then prefix each line with L
, R
or D
.
So, if you use:
sed '1d; s/^/L /' lcls.txt
sed '1d; s/^/R /' ret.txt
sed '1d; s/^/D /' out_sign.txt
you'll get this:
L chr1 756164 ENSG00000237094 0.1
L chr1 775930 ENSG00000237094 0.4
L chr1 814583 ENSG00000237094 0.6
L chr1 815963 ENSG00000237094 0.7
R chr6 130235069 ENSG00000164484 5.43098e-06
R chr17 45848600 ENSG00000236234 9.005e-06
R chr11 34862454 ENSG00000110435 1.81973e-06
R chr17 45979467 ENSG00000236234 1.34927e-05
R chr5 81234109 ENSG00000248794 4.36795e-06
D chr1 756164 ENSG00000237094 0 7.15853e-05
D chr1 775930 ENSG00000237094 0 9.72251e-05
D chr1 814583 ENSG00000237094 0 1.88553e-05
D chr1 815963 ENSG00000237094 0 3.85837e-06
D chr6 130235069 ENSG00000164484 5.43098e-06 0
D chr17 45848600 ENSG00000236234 9.005e-06 0
D chr11 34862454 ENSG00000110435 1.81973e-06 0
D chr17 45979467 ENSG00000236234 1.34927e-05 0
D chr5 81234109 ENSG00000248794 4.36795e-06 0
Now you can awk
like this:
{ sed '1d;s/^/L /' lcls.txt; sed '1d;s/^/R /' ret.txt; sed '1d;s/^/D /' out_sign.txt; } |
awk '
/^L/ {lcls[$2,$3,$4]=$5; next} # build "lcls" table
/^R/ {ret[$2,$3,$4]=$5; next} # build "ret" table
/^D/ {... per dash-o lines starting "if" but field numbers incremented ... }'
Or, if you like bash
process-substitution, this is equivalent:
awk '...' <(sed '1d; s/^/L /' lcls.txt) <(sed '1d; s/^/R /' ret.txt) <(sed '1d; s/^/D /' out_sign.txt)
Upvotes: 0