Joe Healey
Joe Healey

Reputation: 1252

Sed regex between known word and unknown integer

I can't quite get the regex I need to solve this, so asking the SO wizards for help!

Given:

LOCUS       NODE_96_length_17326_cov_8.76428_ID_1>17327 bp   DNA linear
LOCUS       NODE_97_length_17208_cov_6.56803_ID_1>17208 bp   DNA linear
LOCUS       NODE_98_length_17111_cov_6.60638_ID_1>17111 bp   DNA linear
LOCUS       NODE_99_length_17092_cov_6.7682_ID_19717092 bp   DNA linear
LOCUS       NODE_9_length_59921_cov_8.04963_ID_1759921 bp   DNA linear

I need to replace the string between NODE and the sequence of numbers at the end of that same string. The character preceeding the numbers (e.g. in line 1, 17327) can appear as a > or a _. So basically I need to replace everything from NODE up to and including the last > or _, or match up until a multi-digit integer of unknown length.

Best I'd managed so far was:

sed 's/\(NODE.*\)\(>|_\)/newstring/'

But I know this doesn't work.

Just to make it painfully clear, this would be the desired output.

LOCUS       newstring 17327 bp   DNA linear
LOCUS       newstring 17208 bp   DNA linear
LOCUS       newstring 17111 bp   DNA linear
LOCUS       newstring 19717092 bp   DNA linear
LOCUS       newstring 1759921 bp   DNA linear

Upvotes: 1

Views: 43

Answers (2)

linden2015
linden2015

Reputation: 887

I would do it like this:

\b(NODE.*\D)\d+\s

A word boundary, the word NODE, anything up until something that is not a digit, then one or more digits, then a whitespace character. Demo

Sed might need the word boundary as \< (start of word).

Upvotes: 1

anubhava
anubhava

Reputation: 784898

You don't need to use any group since you are not using any back-references. You can use:

sed 's/NODE[^[:blank:]]*[_>]/newstring /' file

LOCUS       newstring 17327 bp   DNA linear
LOCUS       newstring 17208 bp   DNA linear
LOCUS       newstring 17111 bp   DNA linear
LOCUS       newstring 19717092 bp   DNA linear
LOCUS       newstring 1759921 bp   DNA linear

Upvotes: 3

Related Questions