Reputation: 129
I have a file with tens of thousands of tab-separated lines like this:
cluster11586 TRINITY_DN135758_c4_g1_i1 5'-adenylylsulfate reductase-like 4 9.10921
cluster41208 TRINITY_DN130890_c2_g1_i1 Anthranilate phosphoribosyltransferase, chloroplastic 18.5398
cluster26862 TRINITY_DN132510_c1_g1_i2 ATP synthase subunit alpha, mitochondrial 4.82626
cluster13001 TRINITY_DN130890_c4_g1_i3 Phosphopantetheine adenylyltransferase 2.58108
I would like to use grep/awk/sed to produce a file with the text after the first two columns and before the final decimal number, with the tabs removed and the white spaces replaced with underscores:
5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase
I thought of extracting everything before the final decimal number, which I can match with [0-9]+\.[0-9]+$
, and then piping the result to something similar to awk '{$1=$2=""; print $0}'
to remove the first two columns (and hopefully the following tab too), and then send that to sed -e 's/ /_/g'
But how can one extract the text before the final decimal number in each line, without getting the decimal number itself nor the preceding spaces? And awk seems to leave the tab after removing the first two columns. Can I do all this without outputting intermediate files?
Upvotes: 3
Views: 180
Reputation: 104032
You can do:
$ cut -d $'\t' -f 3- file |
sed -nE 's/^(.*)[[:space:]][[:digit:]][[:digit:]]*\.[[:digit:]][[:digit:]]*/\1/; s/[[:space:]]*$//; s/[[:space:]]/_/gp'
5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase
Since the final decimal number is tab delimited, you can rely more on cut
to find the fields and only use sed
to change ' '
to _
:
$ cut -d $'\t' -f 3- file | cut -d $'\t' -f 1 | sed -E 's/[[:space:]]/_/g'
Upvotes: 0
Reputation: 204218
Understanding this will give you a good idea how awk works with fields and field separators to split and recombine records:
$ awk '{$1=$2=$NF=""; $0=$0; OFS="_"; $1=$1; OFS=FS} 1' file
5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase
In steps:
$ awk '{$1=$2=$NF=""; print "<" $0 ":" $1 ">"}' file
< 5'-adenylylsulfate reductase-like 4 :>
< Anthranilate phosphoribosyltransferase, chloroplastic :>
< ATP synthase subunit alpha, mitochondrial :>
< Phosphopantetheine adenylyltransferase :>
$ awk '{$1=$2=$NF=""; $0=$0; print "<" $0 ":" $1 ">"}' file
< 5'-adenylylsulfate reductase-like 4 :5'-adenylylsulfate>
< Anthranilate phosphoribosyltransferase, chloroplastic :Anthranilate>
< ATP synthase subunit alpha, mitochondrial :ATP>
< Phosphopantetheine adenylyltransferase :Phosphopantetheine>
$ awk '{$1=$2=$NF=""; $0=$0; $1=$1; print "<" $0 ":" $1 ">"}' file
<5'-adenylylsulfate reductase-like 4:5'-adenylylsulfate>
<Anthranilate phosphoribosyltransferase, chloroplastic:Anthranilate>
<ATP synthase subunit alpha, mitochondrial:ATP>
<Phosphopantetheine adenylyltransferase:Phosphopantetheine>
$ awk '{$1=$2=$NF=""; $0=$0; OFS="_"; $1=$1; OFS=FS; print "<" $0 ":" $1 ">"}' file
<5'-adenylylsulfate_reductase-like_4:5'-adenylylsulfate>
<Anthranilate_phosphoribosyltransferase,_chloroplastic:Anthranilate>
<ATP_synthase_subunit_alpha,_mitochondrial:ATP>
<Phosphopantetheine_adenylyltransferase:Phosphopantetheine>
Upvotes: 3
Reputation: 20022
Remove the first 2 combinations (string without tab - tab),
remember the next part that will not finish with a digit,
and match the decimal number.
sed -r 's/([^\t]*\t){2}(.*[^0-9])[0-9]*[.][0-9]*$/\2/' file
Necht two simple replacements
sed -r 's/([^\t]*\t){2}(.*[^0-9])[0-9]*[.][0-9]*$/\2/;s/ /_/g;s/\t//g' file
Upvotes: 0