Reputation: 1252
I can't quite get the regex I need to solve this, so asking the SO wizards for help!
Given:
LOCUS NODE_96_length_17326_cov_8.76428_ID_1>17327 bp DNA linear
LOCUS NODE_97_length_17208_cov_6.56803_ID_1>17208 bp DNA linear
LOCUS NODE_98_length_17111_cov_6.60638_ID_1>17111 bp DNA linear
LOCUS NODE_99_length_17092_cov_6.7682_ID_19717092 bp DNA linear
LOCUS NODE_9_length_59921_cov_8.04963_ID_1759921 bp DNA linear
I need to replace the string between NODE
and the sequence of numbers at the end of that same string. The character preceeding the numbers (e.g. in line 1, 17327
) can appear as a >
or a _
. So basically I need to replace everything from NODE
up to and including the last >
or _
, or match up until a multi-digit integer of unknown length.
Best I'd managed so far was:
sed 's/\(NODE.*\)\(>|_\)/newstring/'
But I know this doesn't work.
Just to make it painfully clear, this would be the desired output.
LOCUS newstring 17327 bp DNA linear
LOCUS newstring 17208 bp DNA linear
LOCUS newstring 17111 bp DNA linear
LOCUS newstring 19717092 bp DNA linear
LOCUS newstring 1759921 bp DNA linear
Upvotes: 1
Views: 43
Reputation: 887
I would do it like this:
\b(NODE.*\D)\d+\s
A word boundary, the word NODE, anything up until something that is not a digit, then one or more digits, then a whitespace character. Demo
Sed might need the word boundary as \<
(start of word).
Upvotes: 1
Reputation: 784898
You don't need to use any group since you are not using any back-references. You can use:
sed 's/NODE[^[:blank:]]*[_>]/newstring /' file
LOCUS newstring 17327 bp DNA linear
LOCUS newstring 17208 bp DNA linear
LOCUS newstring 17111 bp DNA linear
LOCUS newstring 19717092 bp DNA linear
LOCUS newstring 1759921 bp DNA linear
Upvotes: 3