justaguy
justaguy

Reputation: 3022

awk to add text to each pattern in field

In the below awk I am trying to add :p.= to each $7 only if they have the pattern /NM/ in it. The below seems to do that if there is only one NM in $7, like line 2. However if there are multiple NM in $7, like line 3 then the :p.= gets added only to the last. A ; is used to separate multiple NM in the field. I added comments, but am not sure what I am not doing, thats needed. Thank you :).

input tab-delimited

R_Index Chr Start   End Ref Alt Detail.refGene  Gene.refGene
1   chr1    948846  948846  -   A   dist=1  ISG15
2   chr1    948870  948870  C   G   NM_005101:c.-84C>G  ISG15
3   chr1    948921  948921  T   C   NM_005101:c.-33T>C;NM_005101:c.-84C>G   ISG15
4   chr1    949654  949654  A   G   .   ISG15

awk

awk '
  BEGIN { FS=OFS="\t" }  # define FS and OFS as tab and start processing
  $7 ~ /NM/ {            # look for pattern NM in $7
       # split $7 by ";" and cycle through them
          i=split($7,NM,";")
             for (n=1; n<=i; n++) {
              sub("$", ":p=", $7)   # add :p. to end off each $7 before the ;
    }     # close block
}1' input  # define input file

current output tab-delimited

R_Index Chr Start   End Ref Alt Detail.refGene  Gene.refGene
1   chr1    948846  948846  -   A   dist=1  ISG15
2   chr1    948870  948870  C   G   NM_005101:c.-84C>G:p.=  ISG15
3   chr1    948921  948921  T   C   NM_005101:c.-33T>C;NM_005101:c.-84C>G:p.=p.=    ISG15
4   chr1    949654  949654  A   G   .   ISG15

desired output tab-delimited

R_Index Chr Start   End Ref Alt Detail.refGene  Gene.refGene
1   chr1    948846  948846  -   A   dist=1  ISG15
2   chr1    948870  948870  C   G   NM_005101:c.-84C>G:p.=  ISG15
3   chr1    948921  948921  T   C   NM_005101:c.-33T>C:p.=;NM_005101:c.-84C>G:p.=   ISG15
4   chr1    949654  949654  A   G   .   ISG15

Upvotes: 0

Views: 64

Answers (1)

Ed Morton
Ed Morton

Reputation: 203129

Replace this:

      i=split($7,NM,";")
         for (n=1; n<=i; n++) {
          sub("$", ":p=", $7)   # add :p. to end off each $7 before the ;
         }

with this:

      out=""
      i=split($7,NM,/;/)
         for (n=1; n<=i; n++) {
          sub(/$/, ":p=", NM[i])   # add :p. to end off each NM[i] before the ;
          out = (out=="" ? "" : out";") NM[i]
         }
      $7 = out

Upvotes: 2

Related Questions