Reputation: 3022
I am trying to use awk
to extract and print the first ocurrence of NM_
and the portion after theNP_
starting with p.
. A :
is printed instead of the "|" for each. The input file is tab-delimeted
, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. There maybe multiple NM
or NP
in my actual data of over 5000 lines, however only the first occurence of each is extracted and printed. I am still a little unclear on the RSTART
and RLENGHTH
concepts but, using line 1 as an example from the input:
The NM
variable would be NM_020469.2
The NP
variable would be :p.Gly268Arg
I have included comments as well. Thank you :).
input
Input Variant HGVS description(s) Errors and warnings
rs41302905 NC_000009.11:g.136131316C>T|NM_020469.2:c.802G>A|NP_065202.2:p.Gly268Arg
rs8176745 NC_000009.11:g.136131347G>A|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=
desired output
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
awk
awk -F'[\t|]' 'NR>1{ # define FS as tab and `|` to split each, and skip header line
r=$1; nm=np=""; # create variable r with $1 and 2 variables (one for nm and the other for np, setting them to null)
for(i=2;i<=NF;i++) { # start a loop from line2 and itterate
if ($i~/^NM_/) nm=$i; # extract first NM_ in line and read into i
else if ($i~/^NP_/) np=substr($i,index($i,":")); # extract NP_ and print portion after : (including :)
if (nm && np) { print r,nm np; break } # print desired output
}
}' input
Upvotes: 3
Views: 1353
Reputation: 1517
Another alternative awk proposal.
awk 'NR>1{sub(/\|/," ")sub(/\|NP_065202.2/,"");print $1,$3,$4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
Upvotes: 1
Reputation: 203209
Given your posted sample input, this is all you need to produce your desired output:
$ awk -F'[\t|]+' 'NR>1{sub(/[^:]+/,"",$4); print $1, $3 $4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
If that's not all you need then provide more truly representative input/output.
Upvotes: 1
Reputation: 92854
Short GNU awk solution (with match
function):
awk 'match($0,/(NM_[^|]+).*NP_[^:]+([^[:space:]|]+)/,a){ print $1,a[1] a[2] }' input
The output:
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
Upvotes: 1
Reputation: 92854
Awk solution:
awk -F'[\t|]' 'NR>1{
r=$1; nm=np="";
for(i=2;i<=NF;i++) {
if ($i~/^NM_/) nm=$i;
else if ($i~/^NP_/) np=substr($i,index($i,":"));
if (nm && np) { print r,nm np; break }
}
}' input
'NR>1
- start processing from the 2nd record
r=$1; nm=np=""
- initialization of the needed variables
for(i=2;i<=NF;i++)
- iterating through the fields (starting from the 2nd)
if ($i~/^NM_/) nm=$i
- capturing NM_...
item into variale nm
else if ($i~/^NP_/) np=substr($i,index($i,":"))
- capturing NP_...
item into variale np
(starting from :
till the end)
if (nm && np) { print r,nm np; break }
- if both items has been captured - print them and break the loop to avoid further processing
The output:
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
Upvotes: 1
Reputation: 133428
Could you please try following and let me know if this helps too.
awk '{
match($0,/NM_[^|]*/);
nm=substr($0,RSTART,RLENGTH);
match($0,/NP_([^|]|[^$])*/);
np=substr($0,RSTART,RLENGTH);
split(np, a,":");
if(nm && np){
print $1,nm ":" a[2]
}
}
' Input_file
Output will be as follows.
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
PS: Since your sample Input_file doesn't have TAB in them so you could add "\t" after awk in case your Input_file is TAB delimited and if you want to have output as TAB delimited too, add OFS="\t" before Input_file.
Upvotes: 1