Look up values which are equal to zero in one file and than replace the zero value with found value

Question

I have a data like this:

head out_sign.txt
chr    pos    gene_id    p.val.LCL  p.val.Retina
chr1 756164 ENSG00000237094 0 7.15853e-05
chr1 775930 ENSG00000237094 0 9.72251e-05
chr1 814583 ENSG00000237094 0 1.88553e-05
chr1 815963 ENSG00000237094 0 3.85837e-06
chr6 130235069 ENSG00000164484 5.43098e-06 0
chr17 45848600 ENSG00000236234 9.005e-06 0
chr11 34862454 ENSG00000110435 1.81973e-06 0
chr17 45979467 ENSG00000236234 1.34927e-05 0
chr5 81234109 ENSG00000248794 4.36795e-06 0

so sometimes p.val.LCL is ==0 and sometimes p.val.Retina==0.

Every times one of them is equal to zero I would like to look up the corresponding combination of: chr, pos, gene_id in these two files:

for p.val.LCL==0 look up in file called: "lcls.txt" for the combination of chr, pos, gene_id file and replace the zero value with the one found in "lcls.txt"

head lcls.txt
chr pos gene_id pval_nominal
chr1 756164 ENSG00000237094 0.1
chr1 775930 ENSG00000237094 0.4
chr1 814583 ENSG00000237094 0.6
chr1 815963 ENSG00000237094 0.7

for

p.val.Retina==0 look up the combination of chr, pos, gene_id in "ret.txt" file and replace the zero value with the one found in "ret.txt"

head ret.txt
chr pos gene_id pval_ret
chr6 130235069 ENSG00000164484 5.43098e-06
chr17 45848600 ENSG00000236234 9.005e-06 
chr11 34862454 ENSG00000110435 1.81973e-06
chr17 45979467 ENSG00000236234 1.34927e-05
chr5 81234109 ENSG00000248794 4.36795e-06

So for this given example the SOLUTION would be:

chr    pos      gene_id         p.val.LCL   p.val.Retina
chr1  756164    ENSG00000237094 0.1         7.15853e-05
chr1  775930    ENSG00000237094 0.4         9.72251e-05
chr1  814583    ENSG00000237094 0.6         1.88553e-05
chr1  815963    ENSG00000237094 0.7         3.85837e-06
chr6  130235069 ENSG00000164484 5.43098e-06 5.43098e-06
chr17 45848600  ENSG00000236234 9.005e-06   9.005e-06
chr11 34862454  ENSG00000110435 1.81973e-06 1.81973e-06
chr17 45979467  ENSG00000236234 1.34927e-05 1.34927e-05
chr5  81234109  ENSG00000248794 4.36795e-06 4.36795e-06

The Solution file would have the same number of lines like out_sign.txt.

I should mention that these files are fairly large, number of lines in each file is:

107371529 ret.txt
166941636 lcls.txt
2250213 out_sign.txt

Probably awk solution would be the best because of the size of the data. I tried doing this in R but I am getting memory related error. awk should be able to handle it.

dash-o · Accepted Answer

Possible to implement using awk script

awk '
t == "lcls" { if ( FNR > 1 ) lcls[$1, $2, $3] = $4 ; next }
t == "ret" { if ( FNR > 1 ) ret[$1, $2, $3] = $4 ; next }

        # Lookup for main file
{
        if ( $4 == 0 ) { v = lcls[$1, $2, $3] ; if ( v ) $4 = v ; }
        if ( $5 == 0 ) { v = ret[$1, $2, $3] ; if ( v ) $5 = v ; }
        print
}
' t=ret ret.txt t=lcls lcls.txt t= out_sign.txt

It load the lookup files into awk tables, and then perform the replacement on the 'data' file. The 't' variable indicate the input 'type'.

Look up values which are equal to zero in one file and than replace the zero value with found value

Answers (2)

Related Questions