Reputation: 4094
I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:
> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")
[,1]
[1,] "aaaab"
Base functions behave the same way:
> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE
I would expect the match to be just ab
as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:
By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)
Could someone please explain me what's going on?
Update. What's crazy is that in some other cases non-greedy patterns behave as expected:
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
[,1]
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
[,1]
[1,] "<a href=\"abc\">"
Upvotes: 28
Views: 18130
Reputation: 627488
The problem is matching the shortest window between two strings. @flodel correctly mentions that a regex engine is parsing the string from left to right, and thus all the matches are leftmost. Greediness and laziness only apply to the boundaries on the right: greedy quantifiers get the substrings up to the rightmost boundaries, and the lazy ones will match up to the first occurrence of the subpatterns to follow.
See the examples:
> library(stringr)
> str_extract('xxx aaaab yyy', "a[^ab]*b")
[1] "ab"
> str_extract('xxx aaa xxx aaa zzz', "xxx.*?zzz")
[1] "xxx aaa xxx aaa zzz"
> str_extract('xxx aaa xxx aaa zzz', "xxx(?:(?!xxx|zzz).)*zzz")
[1] "xxx aaa zzz"
The first and the third scenarios return the shortest window, the second one is an illustration of the current problem but with a multicharacter input.
Scenario 1. Boundaries are single characters
In case a
and b
are single characters, the shortest window is found by using a negated character class. a[^ab]*b
will easily grab the substring from a
till the next b
with no a
s and b
s in between.
Scenario 2. Boundaries are not single characters
You may use a tempered greedy token in these cases that can be further unrolled. The xxx(?:(?!xxx|zzz).)*zzz
pattern matches xxx
, then any 0+ chars other than a linebreak char that is not the starting char of a xxx
or zzz
char sequence (the (?!xxx|zzz)
is a negative lookahead that fails the match if the substring immediately to the right matches the lookahead pattern), and then a zzz
.
These matching scenarios can be easily used with base R regmatches
(using a PCRE regex flavor that supports lookaheads):
> x <- 'xxx aaa xxx aaa zzz xxx bbb xxx ccc zzz'
> unlist(regmatches(x, gregexpr("xxx(?:(?!xxx|zzz).)*zzz", x, perl = TRUE)))
[1] "xxx aaa zzz" "xxx ccc zzz"
One note: when using a PCRE regex in base R, or the ICU regex in str_extract
/str_match
, the .
does not match linebreak characters, to enable that behavior, you need to add (?s)
at the pattern start (an inline DOTALL modifier).
Upvotes: 11
Reputation: 89097
Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.
Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab
, aaab
, aab
, and ab
are matches to your pattern, but aaaab
being the one that starts the most to the left is the one that is returned.
So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:
str_match('xxx aaaab yyy', "a.*?y")
# [,1]
# [1,] "aaaab y"
Here all of the strings aaaab y
, aaaab yy
, aaaab yyy
matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.
So what can you do to catch that last ab
? Use this:
str_match('xxx aaaab yyy', ".*(a.*b)")
# [,1] [,2]
# [1,] "xxx aaaab" "ab"
How does it work? By adding a greedy pattern .*
in the front, you are now forcing the process to put the last possible a
into the captured group.
Upvotes: 38