awk or other bioinformatics tools to filter vcf

I am trying to filter some lines in a vcf file, here is an example of lines:

1   10505   rs548419688 A   T   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9632;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1   10506   rs568405545 C   G   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9676;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1   10511   rs534229142 G   A   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9869;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;E
UR_AF=0;SAS_AF=0;AA=.|||;VT=SNP
1   10539   rs537182016 C   A   100 PASS    AC=3;AF=0.000599042;AN=5008;NS=2504;DP=9203;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;E
UR_AF=0.001;SAS_AF=0.001;AA=.|||;VT=SNP
1   10542   rs572818783 C   T   100 PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=9007;EAS_AF=0.001;AMR_AF=0;AFR_AF=0;EU
R_AF=0;SAS_AF=0;AA=.|||;VT=SNP

Say I want to extract lines with AMR_AF larger than 0.5, but couldn't figure out how to use Awk regular expressions to do such job. Tried vcftools, but that didn't work.

Upvotes: 0

Answers (3)

zx8754

Reputation: 56219

Using bcftools:

bcftools view -i 'INFO/AMR_AF > 0.5' myFile.vcf

See for more options from bcftools manuals.

Upvotes: 0

tripleee

Reputation: 189789

You can split the line on the field you choose and examine whether the numeric value of the element just after the split is larger than your threshold.

In some more detail, splitting the input yes,foo=2,bar=0.23,baz=1 on ,bar= will yield an array containing yes,foo=2 and 0.23,baz=1. In Awk, if you compare the second element to 0.2, it will simply convert as much as it can from the beginning of the value into a number and then perform a numeric comparison.

Thus

awk '{ split($0, x, /[\t;]AMR_AF=/) } x[2]>0.5' file.vcf

should do what you want. We split the line into x and examine the numeric value of x[2].

The [\t;] in the regex allows for either a tab or a semicolon before the field's name; to be perfectly general, perhaps you should even use (^|[\t;]) to also permit the match to happen at beginning of line.

If you want to parametrize this, maybe try

awk -v field="AMR_AF" -v thres=0.5 '{ split($0, x, "(^|[\t;])" field "=")) } x[2]>thres' file.vcf

Recall that Awk processes the script for each input line from top to bottom, where each script statement has the form

[ condition ] [ { action } ]

As the square brackets indicate, both parts are optional -- if condition is missing, the action is taken unconditionally; if action is missing, it defaults to { print $0 }. So our script will first unconditionally split the line, then conditionally print it if x[2] is larger than the threshold.

GNU Awk can split on a multi-character field separator, so you could use -F '[\t;]AMR_AF=' too.

awk -F '[\t;]AMR_AF=' '$2>0.5' file.vcf

Upvotes: 1

RavinderSingh13

Reputation: 133710

Could you please try following.

awk 'match($0,/AMR_AF=[0-9]+\.[0-9]+|AMR_AF=[0-9]+/) && substr($0,RSTART+7,RLENGTH-7)>0.5'  Input_file

Explanation: Using match function of awk to match regex AMR_AF= digits.digits OR AMR_AF=digits and whenever this regex gets matches on line then it sets RSTART and RLENGTH variables. &&(AND condition) to check if sub-string value of RSTART+7 to till RLENGTH-7 value is greater than 0.5 then print that line.

Upvotes: 1

awk or other bioinformatics tools to filter vcf

Answers (3)

Related Questions