Reputation: 3022
In the below awk
I am trying to add :p.=
to each $7
only if they have the pattern /NM/
in it. The below seems to do that if there is only one NM
in $7
, like line 2. However if there are multiple NM
in $7
, like line 3 then the :p.=
gets added only to the last. A ;
is used to separate multiple NM
in the field. I added comments, but am not sure what I am not doing, thats needed. Thank you :).
input tab-delimited
R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene
1 chr1 948846 948846 - A dist=1 ISG15
2 chr1 948870 948870 C G NM_005101:c.-84C>G ISG15
3 chr1 948921 948921 T C NM_005101:c.-33T>C;NM_005101:c.-84C>G ISG15
4 chr1 949654 949654 A G . ISG15
awk
awk '
BEGIN { FS=OFS="\t" } # define FS and OFS as tab and start processing
$7 ~ /NM/ { # look for pattern NM in $7
# split $7 by ";" and cycle through them
i=split($7,NM,";")
for (n=1; n<=i; n++) {
sub("$", ":p=", $7) # add :p. to end off each $7 before the ;
} # close block
}1' input # define input file
current output tab-delimited
R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene
1 chr1 948846 948846 - A dist=1 ISG15
2 chr1 948870 948870 C G NM_005101:c.-84C>G:p.= ISG15
3 chr1 948921 948921 T C NM_005101:c.-33T>C;NM_005101:c.-84C>G:p.=p.= ISG15
4 chr1 949654 949654 A G . ISG15
desired output tab-delimited
R_Index Chr Start End Ref Alt Detail.refGene Gene.refGene
1 chr1 948846 948846 - A dist=1 ISG15
2 chr1 948870 948870 C G NM_005101:c.-84C>G:p.= ISG15
3 chr1 948921 948921 T C NM_005101:c.-33T>C:p.=;NM_005101:c.-84C>G:p.= ISG15
4 chr1 949654 949654 A G . ISG15
Upvotes: 0
Views: 64
Reputation: 203129
Replace this:
i=split($7,NM,";")
for (n=1; n<=i; n++) {
sub("$", ":p=", $7) # add :p. to end off each $7 before the ;
}
with this:
out=""
i=split($7,NM,/;/)
for (n=1; n<=i; n++) {
sub(/$/, ":p=", NM[i]) # add :p. to end off each NM[i] before the ;
out = (out=="" ? "" : out";") NM[i]
}
$7 = out
Upvotes: 2