Reputation: 129

Grep all characters in each line before match

I have a file with tens of thousands of tab-separated lines like this:

cluster11586    TRINITY_DN135758_c4_g1_i1   5'-adenylylsulfate reductase-like 4 9.10921
cluster41208    TRINITY_DN130890_c2_g1_i1   Anthranilate phosphoribosyltransferase, chloroplastic   18.5398
cluster26862    TRINITY_DN132510_c1_g1_i2   ATP synthase subunit alpha, mitochondrial   4.82626
cluster13001    TRINITY_DN130890_c4_g1_i3   Phosphopantetheine adenylyltransferase  2.58108

I would like to use grep/awk/sed to produce a file with the text after the first two columns and before the final decimal number, with the tabs removed and the white spaces replaced with underscores:

5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase

I thought of extracting everything before the final decimal number, which I can match with [0-9]+\.[0-9]+$, and then piping the result to something similar to awk '{$1=$2=""; print $0}' to remove the first two columns (and hopefully the following tab too), and then send that to sed -e 's/ /_/g' But how can one extract the text before the final decimal number in each line, without getting the decimal number itself nor the preceding spaces? And awk seems to leave the tab after removing the first two columns. Can I do all this without outputting intermediate files?

Upvotes: 3

Answers (3)

dawg

Reputation: 104032

You can do:

$ cut -d $'\t' -f 3- file | 
  sed -nE 's/^(.*)[[:space:]][[:digit:]][[:digit:]]*\.[[:digit:]][[:digit:]]*/\1/; s/[[:space:]]*$//; s/[[:space:]]/_/gp'
5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase

Since the final decimal number is tab delimited, you can rely more on cut to find the fields and only use sed to change ' ' to _:

$ cut -d $'\t' -f 3- file | cut -d $'\t' -f 1 | sed -E 's/[[:space:]]/_/g'

Upvotes: 0

Ed Morton

Reputation: 204218

Understanding this will give you a good idea how awk works with fields and field separators to split and recombine records:

$ awk '{$1=$2=$NF=""; $0=$0; OFS="_"; $1=$1; OFS=FS} 1' file
5'-adenylylsulfate_reductase-like_4
Anthranilate_phosphoribosyltransferase,_chloroplastic
ATP_synthase_subunit_alpha,_mitochondrial
Phosphopantetheine_adenylyltransferase

In steps:

$ awk '{$1=$2=$NF=""; print "<" $0 ":" $1 ">"}' file
<  5'-adenylylsulfate reductase-like 4 :>
<  Anthranilate phosphoribosyltransferase, chloroplastic :>
<  ATP synthase subunit alpha, mitochondrial :>
<  Phosphopantetheine adenylyltransferase :>

$ awk '{$1=$2=$NF=""; $0=$0; print "<" $0 ":" $1 ">"}' file
<  5'-adenylylsulfate reductase-like 4 :5'-adenylylsulfate>
<  Anthranilate phosphoribosyltransferase, chloroplastic :Anthranilate>
<  ATP synthase subunit alpha, mitochondrial :ATP>
<  Phosphopantetheine adenylyltransferase :Phosphopantetheine>

$ awk '{$1=$2=$NF=""; $0=$0; $1=$1; print "<" $0 ":" $1 ">"}' file
<5'-adenylylsulfate reductase-like 4:5'-adenylylsulfate>
<Anthranilate phosphoribosyltransferase, chloroplastic:Anthranilate>
<ATP synthase subunit alpha, mitochondrial:ATP>
<Phosphopantetheine adenylyltransferase:Phosphopantetheine>

$ awk '{$1=$2=$NF=""; $0=$0; OFS="_"; $1=$1; OFS=FS; print "<" $0 ":" $1 ">"}' file
<5'-adenylylsulfate_reductase-like_4:5'-adenylylsulfate>
<Anthranilate_phosphoribosyltransferase,_chloroplastic:Anthranilate>
<ATP_synthase_subunit_alpha,_mitochondrial:ATP>
<Phosphopantetheine_adenylyltransferase:Phosphopantetheine>

Upvotes: 3

Walter A

Reputation: 20022

Remove the first 2 combinations (string without tab - tab),
remember the next part that will not finish with a digit,
and match the decimal number.

sed -r 's/([^\t]*\t){2}(.*[^0-9])[0-9]*[.][0-9]*$/\2/' file

Necht two simple replacements

sed -r 's/([^\t]*\t){2}(.*[^0-9])[0-9]*[.][0-9]*$/\2/;s/ /_/g;s/\t//g' file

Upvotes: 0

Grep all characters in each line before match

Answers (3)

Related Questions