R vs sed regex greediness

Question

I don't quite understand why this doesn't result in "test" and would appreciate an explanation:

a = "blah test"
sub('^.*(test|$)', '\1', a)
# [1] ""

Compare it to the sed expression:

echo 'blah test' | sed -r 's/^.*(test|$)/\1/'
# test
echo 'blah blah' | sed -r 's/^.*(test|$)/\1/'
#

Fwiw, the following achieves what I want in R (and is equivalent to the above sed results):

sub('^.*(test)|^.*', '\1', a)

Akash · Accepted Answer

The start of the regex engine matchs all the characters right upto the end of the string i.e greedy .*, then it tries to match (test|$), i.e either the string literal 'test' or the end of the string. Since the first greedy match of .* matched all the characters, it back-references a character and then again tries to match (test|$), here $ matches the end of the string.

Causing your match result to be a end of line character

I think sed uses POSIX NFA which tries to find the longest match in a Alternation, which differs from R, which seems to use a Traditional NFA

R vs sed regex greediness

Answers (2)

Related Questions