Reputation: 9
I am trying to filter some lines in a vcf file, here is an example of lines:
1 10505 rs548419688 A T 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1 10506 rs568405545 C G 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1 10511 rs534229142 G A 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9869;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1 10539 rs537182016 C A 100 PASS AC=3;AF=0.000599042;AN=5008;NS=2504;DP=9203;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;E
UR_AF=0.001;SAS_AF=0.001;AA=.|||;VT=SNP
1 10542 rs572818783 C T 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9007;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EU
R_AF=0;SAS_AF=0;AA=.|||;VT=SNP
Say I want to extract lines with AMR_AF
larger than 0.5, but couldn't figure out how to use Awk regular expressions to do such job. Tried vcftools
, but that didn't work.
Upvotes: 0
Views: 756
Reputation: 56219
Using bcftools:
bcftools view -i 'INFO/AMR_AF > 0.5' myFile.vcf
See for more options from bcftools manuals.
Upvotes: 0
Reputation: 189789
You can split the line on the field you choose and examine whether the numeric value of the element just after the split is larger than your threshold.
In some more detail, splitting the input yes,foo=2,bar=0.23,baz=1
on ,bar=
will yield an array containing yes,foo=2
and 0.23,baz=1
. In Awk, if you compare the second element to 0.2
, it will simply convert as much as it can from the beginning of the value into a number and then perform a numeric comparison.
Thus
awk '{ split($0, x, /[\t;]AMR_AF=/) } x[2]>0.5' file.vcf
should do what you want. We split the line into x
and examine the numeric value of x[2]
.
The [\t;]
in the regex allows for either a tab or a semicolon before the field's name; to be perfectly general, perhaps you should even use (^|[\t;])
to also permit the match to happen at beginning of line.
If you want to parametrize this, maybe try
awk -v field="AMR_AF" -v thres=0.5 '{ split($0, x, "(^|[\t;])" field "=")) } x[2]>thres' file.vcf
Recall that Awk processes the script for each input line from top to bottom, where each script statement has the form
[ condition ] [ {
action }
]
As the square brackets indicate, both parts are optional -- if condition is missing, the action is taken unconditionally; if action is missing, it defaults to { print $0 }
. So our script will first unconditionally split the line, then conditionally print it if x[2]
is larger than the threshold.
GNU Awk can split on a multi-character field separator, so you could use -F '[\t;]AMR_AF='
too.
awk -F '[\t;]AMR_AF=' '$2>0.5' file.vcf
Upvotes: 1
Reputation: 133710
Could you please try following.
awk 'match($0,/AMR_AF=[0-9]+\.[0-9]+|AMR_AF=[0-9]+/) && substr($0,RSTART+7,RLENGTH-7)>0.5' Input_file
Explanation: Using match
function of awk
to match regex AMR_AF= digits.digits
OR AMR_AF=digits
and whenever this regex gets matches on line then it sets RSTART
and RLENGTH
variables. &&
(AND condition) to check if sub-string value of RSTART+7
to till RLENGTH-7
value is greater than 0.5 then print that line.
Upvotes: 1