Reputation: 135
My file content looks like this: (two columns separated by "tab")
Others ___
Archaea ___
Archaea_Euryarchaeota ___
Archaea_Methanomicrobia_o_RCII ___
Bacteria1 ___
Bacteria2 ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__ g__
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__ ___
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__ g__
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ g__
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__ g__
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__ g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__ g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__ g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ g__
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__ ___
I am trying to do is:
When I find g__
in the second column, then I need to replace it with the last found word (after p__ or c__ or o__ or f__ or g__) in the first column. For instance in the line
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ g__
g__
in second column should be replaced with Microthrixaceae
.
Similarly, for the instance when ___
is found then replace with the last found word (after p__ or c__ or o__ or f__ or g__) in the first column. I would really appreciate your suggestions.Thanks!
The output should look like this:
Others ___
Archaea ___
Archaea_Euryarchaeota ___
Archaea_Methanomicrobia_o_RCII ___
Bacteria1 ___
Bacteria2 ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__ Holophagaceae
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__ Solibacterales
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__ Solibacterales
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ Sva0725
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__ Chloracidobacteria
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__ SJA-36
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__ Acidimicrobiales
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__ EB1017
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ Microthrixaceae
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__ Actinobacteria
Upvotes: 1
Views: 69
Reputation: 140960
Because sed
:
sed -E 's/^([^\t]*)((p|c|o|f)__[[]?([^];[:space:]]+))([^\t]*)\t(g|_)__/\1\2\5\t\4/'
Because regexes are greedy, this get's the last word after (p|c|o|f)__
and substituted for the second column.
Tested with:
cat <<EOF | tr -s ' ' | tr ' ' '\t' |
Others ___
Archaea ___
Archaea_Euryarchaeota ___
Archaea_Methanomicrobia_o_RCII ___
Bacteria1 ___
Bacteria2 ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__ g__
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__ ___
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__ g__
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ g__
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__ g__
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__ g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__ g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__ g__
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ g__
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__ ___
EOF
sed -E 's/^([^\t]*)((p|c|o|f)__[[]?([^];[:space:]]+))([^\t]*)\t(g|_)__/\1\2\5\t\4/'
produces:
Others ___
Archaea ___
Archaea_Euryarchaeota ___
Archaea_Methanomicrobia_o_RCII ___
Bacteria1 ___
Bacteria2 ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__ Holophagaceae
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__ Solibacterales
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__ Solibacterales
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ Sva0725
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__ Chloracidobacteria
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__ SJA-36
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__ Acidimicrobiales
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__ EB1017
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ Microthrixaceae
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__ Actinobacteria
Short explanation of ^([^\t]*)((p|c|o|f)__[[]?([^];[:space:]]+))([^\t]*)\t(g|_)__
regex:
^
- match beginning of the file([^\t]*)
- match and remember everything in front of the (p|c|o|f)__
. We match up until \t
, cause we are interested in first field only((p|c|o|f)__[[]?([^];[:space:]]+))
(p|c|o|f)__
- match teh initialial prefix[[]?
- optionally remove the [
in front of it([^];[:space:]]+)
- you didn't define what word is - I match up until ;
or ]
or whitespace is encountered.([^\t]*)
- match the rest of first field\t
- the separator(g|_)__
- match the second fieldUpvotes: 0
Reputation: 37394
One in GNU awk:
$ awk -F"[\t;]" '
BEGIN {
p="^[pcofg]_+"
}
{
for(i=NF-1;i>=1;i--)
if($i~p "[^_$]") {
b=$i
sub(p,"",b)
print gensub(/[^;\t]+$/,b,1,$0)
next
}
}1' file
Output:
Others ___
Archaea ___
Archaea_Euryarchaeota ___
Archaea_Methanomicrobia_o_RCII ___
Bacteria1 ___
Bacteria2 ___
Bacteria;p__Acidobacteria;c__Holophagae;o__Holophagales;f__Holophagaceae;g__ Holophagaceae
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;__;__ Solibacterales
Bacteria;p__Acidobacteria;c__Solibacteres;o__Solibacterales;f__;g__ Solibacterales
Bacteria;p__Acidobacteria;c__Sva0725;o__Sva0725;f__;g__ Sva0725
Bacteria;p__Acidobacteria;c__[Chloracidobacteria];o__;f__;g__ [Chloracidobacteria]
Bacteria;p__Acidobacteria;c__iii1-8;o__SJA-36;f__;g__ SJA-36
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__;g__ Acidimicrobiales
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__EB1017;g__ EB1017
Bacteria;p__Actinobacteria;c__Acidimicrobiia;o__Acidimicrobiales;f__Microthrixaceae;g__ Microthrixaceae
Bacteria;p__Actinobacteria;c__Actinobacteria;__;__;__ Actinobacteria
Upvotes: 0
Reputation: 4340
awk
or perl
or even sed
are definitely better choices here than pure bash
. Here's a perl
solution:
perl -pe 's/(.*?)([a-z]__\[?)([A-Za-z0-9-]+)(\])?(;[a-z]?__)*(\t)([g_]__)/$1$2$3$4$5\t$3/' yourfilename
For some explanation on why this works, mouseover the highlighted regular expression string here: https://regex101.com/r/tLpMCG/1
(Note that the regular expression there is very slightly different than in this answer because here I used perl
, but there I was forced to use php
, and I had difficulty pasting in the tabs.)
Upvotes: 1