eddi
eddi

Reputation: 49448

R vs sed regex greediness

I don't quite understand why this doesn't result in "test" and would appreciate an explanation:

a = "blah test"
sub('^.*(test|$)', '\\1', a)
# [1] ""

Compare it to the sed expression:

echo 'blah test' | sed -r 's/^.*(test|$)/\1/'
# test
echo 'blah blah' | sed -r 's/^.*(test|$)/\1/'
#

Fwiw, the following achieves what I want in R (and is equivalent to the above sed results):

sub('^.*(test)|^.*', '\\1', a)

Upvotes: 4

Views: 963

Answers (2)

GSee
GSee

Reputation: 49810

You need to mark the ^.* as non-greedy

> sub('^.*?(test|$)', '\\1', "blah test")
[1] "test"
> sub('^.*?(test|$)', '\\1', "blah blah")
[1] ""

Upvotes: 5

Akash
Akash

Reputation: 5012

The start of the regex engine matchs all the characters right upto the end of the string i.e greedy .*, then it tries to match (test|$), i.e either the string literal 'test' or the end of the string. Since the first greedy match of .* matched all the characters, it back-references a character and then again tries to match (test|$), here $ matches the end of the string.

Causing your match result to be a end of line character

I think sed uses POSIX NFA which tries to find the longest match in a Alternation, which differs from R, which seems to use a Traditional NFA

Upvotes: 2

Related Questions