Reputation: 523

Extract lines with string and variable number pattern

I have a big file with many lines starting like this:

22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS

In these lines, DR2values range from 0 to 1 and I would like to extract those lines that containsDR2values higher than 0.8.

I've tried both sed or awk solutions, but neither seems to work... I've tried the following:

grep "DR2=[0-1]\.[8-9]*" myfile

Upvotes: 0

Answers (3)

Ed Morton

Reputation: 203502

Whenever you have tag=value pairs in your data I find it best to first create an array of those pairings (f[]) below and then you can just access the values by their tags. You didn't provide any input of 0.8 to test against so using the data you did provide:

$ awk '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f["DR2"] > 0.01' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS

or using variables for the tag and value:

$ awk -v tag='DR2' -v val='0.8' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
$
$ awk -v tag='DR2' -v val='0.01' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
$
$ awk -v tag='AF' -v val='0.4' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
$
$ awk -v tag='AF' -v val='0.5' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
$

or using compound conditions:

$ awk '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]}
        (f["AF"] > 0.4) && (f["AF"] < 0.5) && (f["DR2"] >= 0.02)
' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS

The point is whatever comparisons you want to do with the values of those tags is trivial and you don't need to write more code to isolate and save those tags and their values.

Upvotes: 1

kvantour

Reputation: 26481

grep: grep -E 'DR2=\([1-9]\|0[.][89]\)'
sed: sed -n '/\([1-9]\|0[.][89]\)/p'
awk: awk '/\([1-9]\|0[.][89]\)/'

These 3 solutions are all based on a single regular expression and all do the same (see Ruud HelderMan's solution)

With awk, however, you could do an artithmetic check if your limits are a bit more tricky. Let's say, I want the value of DR2 to be between 0.53 and 1.39.

awk '! match($0,/DR2=/) { next }
     { val = substr($0,RSTART+RLENGTH)+0 }
     ( 0.53 < val) && ( val < 1.39 )'

Upvotes: 1

Ruud Helderman

Reputation: 11018

This matches lines with a value greater than or equal to 0.8. If you insist on strictly greater than, then I'll have to add some complexity to prevent 0.8 from matching.

grep 'DR2=\(1\|0\.[89]\)' myfile

The trick is that you need two separate subpatterns: one to match 1 and greater, one to match 0.8 and greater.

Upvotes: 4

Extract lines with string and variable number pattern

Answers (3)

Related Questions