justaguy
justaguy

Reputation: 3022

awk to extract and print first occurrence of patterns

I am trying to use awk to extract and print the first ocurrence of NM_ and the portion after theNP_ starting with p.. A : is printed instead of the "|" for each. The input file is tab-delimeted, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. There maybe multiple NM or NP in my actual data of over 5000 lines, however only the first occurence of each is extracted and printed. I am still a little unclear on the RSTART and RLENGHTH concepts but, using line 1 as an example from the input:

The NM variable would be NM_020469.2

The NP variable would be :p.Gly268Arg

I have included comments as well. Thank you :).

input

Input Variant   HGVS description(s) Errors and warnings
rs41302905  NC_000009.11:g.136131316C>T|NM_020469.2:c.802G>A|NP_065202.2:p.Gly268Arg
rs8176745   NC_000009.11:g.136131347G>A|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=

desired output

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

awk

awk -F'[\t|]' 'NR>1{ # define FS as tab and `|` to split each, and skip header line
              r=$1; nm=np="";  # create variable r with $1 and 2 variables (one for nm and the other for np, setting them to null)
              for(i=2;i<=NF;i++) { # start a loop from line2 and itterate
                  if ($i~/^NM_/) nm=$i;  # extract first NM_ in line and read into i
                  else if ($i~/^NP_/) np=substr($i,index($i,":")); # extract NP_ and print portion after : (including :)
                  if (nm && np) { print r,nm np; break }  # print desired output
              }
          }' input

Upvotes: 3

Views: 1353

Answers (5)

Claes Wikner
Claes Wikner

Reputation: 1517

Another alternative awk proposal.

awk 'NR>1{sub(/\|/," ")sub(/\|NP_065202.2/,"");print $1,$3,$4}' file

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203209

Given your posted sample input, this is all you need to produce your desired output:

$ awk -F'[\t|]+' 'NR>1{sub(/[^:]+/,"",$4); print $1, $3 $4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

If that's not all you need then provide more truly representative input/output.

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Short GNU awk solution (with match function):

awk 'match($0,/(NM_[^|]+).*NP_[^:]+([^[:space:]|]+)/,a){ print $1,a[1] a[2] }' input

The output:

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Awk solution:

awk -F'[\t|]' 'NR>1{
                  r=$1; nm=np="";
                  for(i=2;i<=NF;i++) {
                      if ($i~/^NM_/) nm=$i;
                      else if ($i~/^NP_/) np=substr($i,index($i,":"));
                      if (nm && np) { print r,nm np; break } 
                  }
              }' input

  • 'NR>1 - start processing from the 2nd record

  • r=$1; nm=np="" - initialization of the needed variables

  • for(i=2;i<=NF;i++) - iterating through the fields (starting from the 2nd)

  • if ($i~/^NM_/) nm=$i - capturing NM_... item into variale nm

  • else if ($i~/^NP_/) np=substr($i,index($i,":")) - capturing NP_... item into variale np (starting from : till the end)

  • if (nm && np) { print r,nm np; break } - if both items has been captured - print them and break the loop to avoid further processing


The output:

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

Upvotes: 1

RavinderSingh13
RavinderSingh13

Reputation: 133428

Could you please try following and let me know if this helps too.

awk '{
match($0,/NM_[^|]*/);
nm=substr($0,RSTART,RLENGTH);
match($0,/NP_([^|]|[^$])*/);
np=substr($0,RSTART,RLENGTH);
split(np, a,":");
  if(nm && np){
    print $1,nm ":" a[2]
}
}
'   Input_file

Output will be as follows.

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

PS: Since your sample Input_file doesn't have TAB in them so you could add "\t" after awk in case your Input_file is TAB delimited and if you want to have output as TAB delimited too, add OFS="\t" before Input_file.

Upvotes: 1

Related Questions