Reputation: 288
Goal: I would like to replace every instance of "-
" with ".
" between the string "antisense
" and "::
" in my file.
Representative sample of my file:
>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense_tRNA-Leu_tRNA-Met::NC_009089.1:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense_tRNA-Arg_tRNA-Gly_tRNA-Asp_tRNA-Val::NC_009089.1:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG
Desired output:
>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense_tRNA.Leu_tRNA.Met::NC_009089.1:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val::NC_009089.1:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG
You'll notice above that all lines should be printed to output, but only the last two lines in this example beginning with ">
" appear different (where each "-
" is replaced with ".
" between the string "antisense
" and "::
"). This is because the first five lines beginning with ">
" do not have strings containing "-
" in the described field.
I have been trying to accomplish this with awk, but I am open to sed solutions as well if they are easier. My general approach was to use the string "antisense
" or "antisense_
" as one delimiter and either ":
" or "::
" as the other delimiter, thus making column 2 ($2
) the field targeted for character replacement.
Here are some of my attempts and their erroneous outputs:
$ awk -F'>antisense_|:' '/^ *>antisense_/ {gsub("-", ".", $2); print}' file
>antisense_tadA::NC_009089.1:19643-19848(-)
>antisense_recR::NC_009089.1:22931-23105(+)
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
tRNA.Leu_tRNA.Met NC_009089.1 30389-30422(+)
tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val NC_009089.1 30559-31181(-)
$ awk 'BEGIN {OFS=FS="/antisense/|:"} {gsub("-", ".", $2)} 1' file
>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense_tRNA-Leu_tRNA-Met::NC_009089.1:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense_tRNA-Arg_tRNA-Gly_tRNA-Asp_tRNA-Val::NC_009089.1:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG
$ awk 'BEGIN {OFS=FS="antisense|:"} {gsub("-", ".", $2)} 1' file
>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense|:_tRNA.Leu_tRNA.Metantisense|:antisense|:NC_009089.1antisense|:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense|:_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Valantisense|:antisense|:NC_009089.1antisense|:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG
I understand why/how each of my attempts fail, but I'm not sure how to properly specify more than one multi-character delimiter with awk while ensuring the defined delimiters remain unchanged between the input and output files (no replacement). Any ideas or suggestions?
Upvotes: 1
Views: 883
Reputation: 84579
While awk
is good for a great many things, your current problem is better handled through sed
, you can simply use the normal substitution form sed '/match/s/find/replace/
where you have:
sed ':a /antisense/s/\(^[^:]*\)-\([^:]*\):/\1.\2:/;ta'
Where :a
is a label that the t
option uses to branch to on successful replacement ensuring that all '-'
characters in lines beginning with "antisense"
up to the first ':'
are replaced by '.'
.
Add the -i
option to edit a file in place with sed -i ... file
or use sed -i.bak ... file
to create a copy of the original in file.bak
before replacement.
Example
$ echo "antisense_tRNA-Arg_tRNA-Gly_tRNA-Asp_tRNA-Val::NC_009089.1:30559-31181(-)" |
sed ':a /antisense/s/\(^[^:]*\)-\([^:]*\):/\1.\2:/;ta'
antisense_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val::NC_009089.1:30559-31181(-)
Upvotes: 2
Reputation: 785651
For the example data shown this awk should work for you:
awk 'match($0, /antisense.*::/) {s = substr($0, RSTART, RLENGTH);
gsub(/-/, ".", s); $0 = substr($0, 1, RSTART-1) s substr($0, RSTART + RLENGTH)} 1' file
>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense_tRNA.Leu_tRNA.Met::NC_009089.1:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val::NC_009089.1:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG
Upvotes: 3