The Nightman
The Nightman

Reputation: 5759

Using Awk and match()

I have a sequencing file to analyze that has many lines like the following tab separated line:

chr12   3356475 .   C   A   76.508  .   AB=0;ABP=0;AC=2;AF=1;AN=2;AO=3;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=9.52472;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=8.76405;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=111;QR=0;RO=0;RPP=9.52472;RPPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL    1/1:3:0:0:3:111:-10,-0.90309,0

I am trying to use awk to match particular regions to their DP value. This is how I'm trying it:

awk '$2 == 33564.. { match(DP=) }' file.txt | head

Neither the matching nor the wildcards seem to work.

Ideally this code would output 3 because that is what DP equals.

Upvotes: 1

Views: 88

Answers (2)

peak
peak

Reputation: 116650

Having worked with genomic data, I believe that the following will be more robust than the previously posted solution. The main difference is that the key-value pairs are treated as such, without any assumption about their ordering, etc. The minor difference is the carat ("^") in the regex:

awk -F'\t' '
  $2 ~ /^33564../ {
    n=split($8,a,";");
    for(i=1;i<=n;i++) {
      split(a[i],b,"=");
      if (b[1]=="DP") {print $2, b[2]} }}'

If this script is to be used more than once, then it would be better to abstract the lookup functionality, e.g. like so:

awk -F'\t' '
  function lookup(key, string,  i,n,a,b) {
     n=split(string,a,";");
     for(i=1;i<=n;i++) {
       split(a[i],b,"=");
       if (b[1]==key) {return b[2]}
     }
  }
  $2 ~ /^33564../ {
    val = lookup("DP", $8);
    if (val) {print $2, val;}
  }'

Upvotes: 1

hek2mgl
hek2mgl

Reputation: 157947

You can use either ; or tab as the field delimiter. Doing so you can access the number in $2 and the DP= field in $14:

awk -F'[;\t]' '$2 ~ /33564../{sub(/DP=/,"",$14);print $14}' file.txt

The sub function is used to remove DP= from $14 which leaves only the value.

Btw, if you also add = to the set of field delimiters the value of DP will be in field 21:

awk -F'[;\t=]' '$2 ~ /33564../{print $21}' file.txt

Upvotes: 2

Related Questions