Reputation: 5759
I have a sequencing file to analyze that has many lines like the following tab separated line:
chr12 3356475 . C A 76.508 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=3;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=9.52472;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=8.76405;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=111;QR=0;RO=0;RPP=9.52472;RPPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1/1:3:0:0:3:111:-10,-0.90309,0
I am trying to use awk to match particular regions to their DP value. This is how I'm trying it:
awk '$2 == 33564.. { match(DP=) }' file.txt | head
Neither the matching nor the wildcards seem to work.
Ideally this code would output 3
because that is what DP equals.
Upvotes: 1
Views: 88
Reputation: 116650
Having worked with genomic data, I believe that the following will be more robust than the previously posted solution. The main difference is that the key-value pairs are treated as such, without any assumption about their ordering, etc. The minor difference is the carat ("^") in the regex:
awk -F'\t' '
$2 ~ /^33564../ {
n=split($8,a,";");
for(i=1;i<=n;i++) {
split(a[i],b,"=");
if (b[1]=="DP") {print $2, b[2]} }}'
If this script is to be used more than once, then it would be better to abstract the lookup functionality, e.g. like so:
awk -F'\t' '
function lookup(key, string, i,n,a,b) {
n=split(string,a,";");
for(i=1;i<=n;i++) {
split(a[i],b,"=");
if (b[1]==key) {return b[2]}
}
}
$2 ~ /^33564../ {
val = lookup("DP", $8);
if (val) {print $2, val;}
}'
Upvotes: 1
Reputation: 157947
You can use either ;
or tab
as the field delimiter. Doing so you can access the number in $2
and the DP=
field in $14
:
awk -F'[;\t]' '$2 ~ /33564../{sub(/DP=/,"",$14);print $14}' file.txt
The sub
function is used to remove DP=
from $14
which leaves only the value.
Btw, if you also add =
to the set of field delimiters the value of DP
will be in field 21
:
awk -F'[;\t=]' '$2 ~ /33564../{print $21}' file.txt
Upvotes: 2