Gawain
Gawain

Reputation: 288

Replacing a character in a single field when using more than one multi-character delimiter

Goal: I would like to replace every instance of "-" with "." between the string "antisense" and "::" in my file.

Representative sample of my file:

>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense_tRNA-Leu_tRNA-Met::NC_009089.1:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense_tRNA-Arg_tRNA-Gly_tRNA-Asp_tRNA-Val::NC_009089.1:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG

Desired output:

>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense_tRNA.Leu_tRNA.Met::NC_009089.1:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val::NC_009089.1:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG

You'll notice above that all lines should be printed to output, but only the last two lines in this example beginning with ">" appear different (where each "-" is replaced with "." between the string "antisense" and "::"). This is because the first five lines beginning with ">" do not have strings containing "-" in the described field.

I have been trying to accomplish this with awk, but I am open to sed solutions as well if they are easier. My general approach was to use the string "antisense" or "antisense_" as one delimiter and either ":" or "::" as the other delimiter, thus making column 2 ($2) the field targeted for character replacement.

Here are some of my attempts and their erroneous outputs:

$ awk -F'>antisense_|:' '/^ *>antisense_/ {gsub("-", ".", $2); print}' file
    >antisense_tadA::NC_009089.1:19643-19848(-)
    >antisense_recR::NC_009089.1:22931-23105(+)
    >antisense_16s_rRNA::NC_009089.1:25279-26010(-)
     tRNA.Leu_tRNA.Met  NC_009089.1 30389-30422(+)
     tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val  NC_009089.1 30559-31181(-)

$ awk 'BEGIN {OFS=FS="/antisense/|:"} {gsub("-", ".", $2)} 1' file
    >-::NC_009089.1:17609-17804(+)
    ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
    >antisense_tadA::NC_009089.1:19643-19848(-)
    TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
    >-::NC_009089.1:20139-20394(-)
    GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
    >antisense_recR::NC_009089.1:22931-23105(+)
    TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
    >antisense_16s_rRNA::NC_009089.1:25279-26010(-)
    CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
    >antisense_tRNA-Leu_tRNA-Met::NC_009089.1:30389-30422(+)
    TTTACATAGAGTTAACACTCTAAAAACTGCACA
    >antisense_tRNA-Arg_tRNA-Gly_tRNA-Asp_tRNA-Val::NC_009089.1:30559-31181(-)
    CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG

$ awk 'BEGIN {OFS=FS="antisense|:"} {gsub("-", ".", $2)} 1' file
    >-::NC_009089.1:17609-17804(+)
    ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
    >antisense_tadA::NC_009089.1:19643-19848(-)
    TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
    >-::NC_009089.1:20139-20394(-)
    GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
    >antisense_recR::NC_009089.1:22931-23105(+)
    TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
    >antisense_16s_rRNA::NC_009089.1:25279-26010(-)
    CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
    >antisense|:_tRNA.Leu_tRNA.Metantisense|:antisense|:NC_009089.1antisense|:30389-30422(+)
    TTTACATAGAGTTAACACTCTAAAAACTGCACA
    >antisense|:_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Valantisense|:antisense|:NC_009089.1antisense|:30559-31181(-)
    CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG

I understand why/how each of my attempts fail, but I'm not sure how to properly specify more than one multi-character delimiter with awk while ensuring the defined delimiters remain unchanged between the input and output files (no replacement). Any ideas or suggestions?

Upvotes: 1

Views: 883

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84579

While awk is good for a great many things, your current problem is better handled through sed, you can simply use the normal substitution form sed '/match/s/find/replace/ where you have:

sed ':a /antisense/s/\(^[^:]*\)-\([^:]*\):/\1.\2:/;ta'

Where :a is a label that the t option uses to branch to on successful replacement ensuring that all '-' characters in lines beginning with "antisense" up to the first ':' are replaced by '.'.

Add the -i option to edit a file in place with sed -i ... file or use sed -i.bak ... file to create a copy of the original in file.bak before replacement.

Example

$ echo "antisense_tRNA-Arg_tRNA-Gly_tRNA-Asp_tRNA-Val::NC_009089.1:30559-31181(-)" | 
  sed ':a /antisense/s/\(^[^:]*\)-\([^:]*\):/\1.\2:/;ta'
antisense_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val::NC_009089.1:30559-31181(-)

Upvotes: 2

anubhava
anubhava

Reputation: 785651

For the example data shown this awk should work for you:

awk 'match($0, /antisense.*::/) {s = substr($0, RSTART, RLENGTH); 
gsub(/-/, ".", s); $0 = substr($0, 1, RSTART-1) s substr($0, RSTART + RLENGTH)} 1' file

>-::NC_009089.1:17609-17804(+)
ATTAAATAGAAAAAATGAATTTAATATAAAAAATTAAAGAAAATTCTAAAAAAAAAAAGATAAGGTCTTA
>antisense_tadA::NC_009089.1:19643-19848(-)
TTTATAAAAATATTTAGTGTTTTTTTTAAATTAGTTCTAAAATAATTTTTAGATATTCATACAAGAGTGT
>-::NC_009089.1:20139-20394(-)
GCTGTTTTTCTATATATGAATTTTGCTACTTTTACATTATTATTATTAAAATAATCTAATTTAAACTCAT
>antisense_recR::NC_009089.1:22931-23105(+)
TCATCTATAATCGCTTTAGATAAAGCTTCCACATCATTAGTATTCATATTAATAATATGAAAAGCCAATC
>antisense_16s_rRNA::NC_009089.1:25279-26010(-)
CTCTATTTTCCTTTTTATTCTATATTTAAATTTTTTATTTACAAGAATATTTTTAATATAACATATTATG
>antisense_tRNA.Leu_tRNA.Met::NC_009089.1:30389-30422(+)
TTTACATAGAGTTAACACTCTAAAAACTGCACA
>antisense_tRNA.Arg_tRNA.Gly_tRNA.Asp_tRNA.Val::NC_009089.1:30559-31181(-)
CTTAACTTCTGTGTTCGGAATGGGAACAGGTGTATCCTCTTTCCCACCAAGTACCATCAGCGCTAAAGAG

Upvotes: 3

Related Questions