DanielSebas
DanielSebas

Reputation: 135

How to substitute a string in a file with two columns?

My file content looks like this: (two columns separated by "tab")

Others  ___
Archaea ___
Archaea_Euryarchaeota   ___
Archaea_Methanomicrobia_o_RCII  ___
Bacteria1       ___
Bacteria2       ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__    g__
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__       ___
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__     g__
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ g__
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__   g__
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__   g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__        g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__  g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ g__
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__   ___

I am trying to do is:

When I find g__ in the second column, then I need to replace it with the last found word (after p__ or c__ or o__ or f__ or g__) in the first column. For instance in the line

Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ g__

g__ in second column should be replaced with Microthrixaceae.

Similarly, for the instance when ___ is found then replace with the last found word (after p__ or c__ or o__ or f__ or g__) in the first column. I would really appreciate your suggestions.Thanks!

The output should look like this:

Others  ___
Archaea ___
Archaea_Euryarchaeota   ___
Archaea_Methanomicrobia_o_RCII  ___
Bacteria1       ___
Bacteria2       ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__ Holophagaceae    
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__       Solibacterales
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__     Solibacterales
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ Sva0725
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__   Chloracidobacteria
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__   SJA-36
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__        Acidimicrobiales
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__  EB1017
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ Microthrixaceae
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__   Actinobacteria

Upvotes: 1

Views: 69

Answers (3)

KamilCuk
KamilCuk

Reputation: 140960

Because sed:

sed -E 's/^([^\t]*)((p|c|o|f)__[[]?([^];[:space:]]+))([^\t]*)\t(g|_)__/\1\2\5\t\4/'

Because regexes are greedy, this get's the last word after (p|c|o|f)__ and substituted for the second column.

Tested with:

cat <<EOF | tr -s ' ' | tr ' ' '\t' |
Others  ___
Archaea ___
Archaea_Euryarchaeota   ___
Archaea_Methanomicrobia_o_RCII  ___
Bacteria1       ___
Bacteria2       ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__    g__
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__       ___
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__     g__
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ g__
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__   g__
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__   g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__        g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__  g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ g__
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__   ___
EOF
sed -E 's/^([^\t]*)((p|c|o|f)__[[]?([^];[:space:]]+))([^\t]*)\t(g|_)__/\1\2\5\t\4/'

produces:

Others  ___
Archaea ___
Archaea_Euryarchaeota   ___
Archaea_Methanomicrobia_o_RCII  ___
Bacteria1   ___
Bacteria2   ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__    Holophagaceae
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__   Solibacterales
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__ Solibacterales
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ Sva0725
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__   Chloracidobacteria
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__   SJA-36
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__    Acidimicrobiales
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__  EB1017
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ Microthrixaceae
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__   Actinobacteria

Short explanation of ^([^\t]*)((p|c|o|f)__[[]?([^];[:space:]]+))([^\t]*)\t(g|_)__ regex:

  • ^ - match beginning of the file
  • ([^\t]*) - match and remember everything in front of the (p|c|o|f)__. We match up until \t, cause we are interested in first field only
  • ((p|c|o|f)__[[]?([^];[:space:]]+))
    • (p|c|o|f)__ - match teh initialial prefix
    • [[]? - optionally remove the [ in front of it
    • ([^];[:space:]]+) - you didn't define what word is - I match up until ; or ] or whitespace is encountered.
  • ([^\t]*) - match the rest of first field
  • \t - the separator
  • (g|_)__ - match the second field

Upvotes: 0

James Brown
James Brown

Reputation: 37394

One in GNU awk:

$ awk -F"[\t;]" '
BEGIN {
    p="^[pcofg]_+"
}
{
    for(i=NF-1;i>=1;i--)
        if($i~p "[^_$]") {
            b=$i
            sub(p,"",b)
            print gensub(/[^;\t]+$/,b,1,$0)
            next
        }
}1' file

Output:

Others  ___
Archaea ___
Archaea_Euryarchaeota   ___
Archaea_Methanomicrobia_o_RCII  ___
Bacteria1       ___
Bacteria2       ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__    Holophagaceae
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__       Solibacterales
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__     Solibacterales
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ Sva0725
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__   [Chloracidobacteria]
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__   SJA-36
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__        Acidimicrobiales
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__  EB1017
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ Microthrixaceae
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__   Actinobacteria

Upvotes: 0

webb
webb

Reputation: 4340

awk or perl or even sed are definitely better choices here than pure bash. Here's a perl solution:

perl -pe 's/(.*?)([a-z]__\[?)([A-Za-z0-9-]+)(\])?(;[a-z]?__)*(\t)([g_]__)/$1$2$3$4$5\t$3/' yourfilename

For some explanation on why this works, mouseover the highlighted regular expression string here: https://regex101.com/r/tLpMCG/1

(Note that the regular expression there is very slightly different than in this answer because here I used perl, but there I was forced to use php, and I had difficulty pasting in the tabs.)

Upvotes: 1

Related Questions