Reputation: 523
I have a big file with many lines starting like this:
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
In these lines, DR2
values range from 0 to 1 and I would like to extract those lines that containsDR2
values higher than 0.8.
I've tried both sed
or awk
solutions, but neither seems to work... I've tried the following:
grep "DR2=[0-1]\.[8-9]*" myfile
Upvotes: 0
Views: 62
Reputation: 203502
Whenever you have tag=value pairs in your data I find it best to first create an array of those pairings (f[]
) below and then you can just access the values by their tags. You didn't provide any input of 0.8 to test against so using the data you did provide:
$ awk '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f["DR2"] > 0.01' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
or using variables for the tag and value:
$ awk -v tag='DR2' -v val='0.8' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
$
$ awk -v tag='DR2' -v val='0.01' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
$
$ awk -v tag='AF' -v val='0.4' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
$
$ awk -v tag='AF' -v val='0.5' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
$
or using compound conditions:
$ awk '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]}
(f["AF"] > 0.4) && (f["AF"] < 0.5) && (f["DR2"] >= 0.02)
' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
The point is whatever comparisons you want to do with the values of those tags is trivial and you don't need to write more code to isolate and save those tags and their values.
Upvotes: 1
Reputation: 26481
grep: grep -E 'DR2=\([1-9]\|0[.][89]\)'
sed: sed -n '/\([1-9]\|0[.][89]\)/p'
awk: awk '/\([1-9]\|0[.][89]\)/'
These 3 solutions are all based on a single regular expression and all do the same (see Ruud HelderMan's solution)
With awk, however, you could do an artithmetic check if your limits are a bit more tricky. Let's say, I want the value of DR2 to be between 0.53 and 1.39.
awk '! match($0,/DR2=/) { next }
{ val = substr($0,RSTART+RLENGTH)+0 }
( 0.53 < val) && ( val < 1.39 )'
Upvotes: 1
Reputation: 11018
This matches lines with a value greater than or equal to 0.8. If you insist on strictly greater than, then I'll have to add some complexity to prevent 0.8 from matching.
grep 'DR2=\(1\|0\.[89]\)' myfile
The trick is that you need two separate subpatterns: one to match 1 and greater, one to match 0.8 and greater.
Upvotes: 4