Cindy Almighty
Cindy Almighty

Reputation: 933

Bash sed with greedy regex

I have a GTF file (type of TSV) with the following structure:

ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|    13511132.24 244.489 2.7098
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|   68  26.127  0   0
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|   712 493.243 0   0

I would like to remove all the names from the first column but the first, as separated by the "|". For example, the first line should be:

ENST00000488147.1    13511132.24 244.489 2.7098

My idea is to replace everything from first "|" to the first "\t" with "\t", but sed is failing me. This command makes no changes:

sed 's/|*\t/\t/' test.tsv 

What am I doing wrong, and is there a better way to do this completely?

Upvotes: 0

Views: 118

Answers (1)

Charles Duffy
Charles Duffy

Reputation: 295969

Consider:

sed -re $'s@[|][^\t]*\t@\t@g'
  • Using $'...' is a ksh/bash syntax extension that makes $'\t' be expanded to a literal tab by the shell, instead of assuming that you have a sed that (without reference to the standard) treats \t sequences as if they were tabs.
  • sed -r puts sed in POSIX ERE mode, vs BRE mode.
  • Using [|] matches only the literal | character, regardless of which regex syntax variant is in use.
  • Using [^\t]* matches zero-or-more things that are not tabs, whereas .* would match things that are tabs, which wouldn't result in the desired output.

In context, as testable code:

write_line() {
  printf '%s\t' "$@" && printf '\n';
}
generate_input() {
  write_line 'ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|' 13511132.24 244.489 2.7098
  write_line 'ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|'    68  26.127  0   0
  write_line 'ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|'    712 493.243 0   0
}
generate_input | sed -re $'s@[|][^\t]*\t@\t@g'

...produces as output:

ENST00000488147.1   13511132.24 244.489 2.7098  
ENST00000619216.1   68  26.127  0   0   
ENST00000473358.1   712 493.243 0   0   

Upvotes: 2

Related Questions