Reputation: 933
I have a GTF file (type of TSV) with the following structure:
ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| 13511132.24 244.489 2.7098
ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA| 68 26.127 0 0
ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA| 712 493.243 0 0
I would like to remove all the names from the first column but the first, as separated by the "|". For example, the first line should be:
ENST00000488147.1 13511132.24 244.489 2.7098
My idea is to replace everything from first "|" to the first "\t" with "\t", but sed is failing me. This command makes no changes:
sed 's/|*\t/\t/' test.tsv
What am I doing wrong, and is there a better way to do this completely?
Upvotes: 0
Views: 118
Reputation: 295969
Consider:
sed -re $'s@[|][^\t]*\t@\t@g'
$'...'
is a ksh/bash syntax extension that makes $'\t'
be expanded to a literal tab by the shell, instead of assuming that you have a sed
that (without reference to the standard) treats \t
sequences as if they were tabs.sed -r
puts sed
in POSIX ERE mode, vs BRE mode.[|]
matches only the literal |
character, regardless of which regex syntax variant is in use.[^\t]*
matches zero-or-more things that are not tabs, whereas .*
would match things that are tabs, which wouldn't result in the desired output.In context, as testable code:
write_line() {
printf '%s\t' "$@" && printf '\n';
}
generate_input() {
write_line 'ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene|' 13511132.24 244.489 2.7098
write_line 'ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA|' 68 26.127 0 0
write_line 'ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA|' 712 493.243 0 0
}
generate_input | sed -re $'s@[|][^\t]*\t@\t@g'
...produces as output:
ENST00000488147.1 13511132.24 244.489 2.7098
ENST00000619216.1 68 26.127 0 0
ENST00000473358.1 712 493.243 0 0
Upvotes: 2