Reputation: 5704
I expect the regex pattern ab{,2}c
to match only with a
followed by 0, 1 or 2 b
s, followed by c
.
It works that way in lots of languages, for instance Python. However, in R:
grepl("ab{,2}c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
# [1] TRUE TRUE TRUE TRUE FALSE
I'm surprised by the 4th TRUE
. In ?regex
, I can read:
{n,m}
The preceding item is matched at leastn
times, but not more thanm
times.
So I agree that {,2}
should be written {0,2}
to be a valid pattern (unlike in Python, where the docs state explicitly that omitting n
specifies a lower bound of zero).
But then using {,2}
should throw an error instead of returning misleading matches! Am I missing something or should this be reported as a bug?
Upvotes: 12
Views: 481
Reputation: 11480
Just as an addition:
vec1 = c('','a', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa','aaaaaaa')
grep("^a{,1}$", vec1, value = T) # seems to "become" ^a{1}$
grep("^a{,2}$", vec1, value = T) # seems to "become" ^a{0,3}$
grep("^a{,3}$", vec1, value = T) # seems to "become" ^a{0,4}$
grep("^a{,4}$", vec1, value = T) # seems to "become" ^a{0,5}$
Upvotes: 3
Reputation: 327
I am writing this as an answer, because unfortunately I cant add a comment.
Update: Following the answer by Wiktor Stribiżew and feedback, seems the behavior is categories as a bug.
Original: The syntax you are using is just not supported in R (assuming the default engine). This is why you are getting unexpected results.
In case you would like to explore differences in syntax, I would recommend taking a look at the regular-expressions.info comparison page. (You need to compare Python and R in terms of Quantifiers in this case.)
Upvotes: 0
Reputation: 626758
The behavior with {,2}
is not expected, it is a bug. If you have a look at the TRE source code, tre_parse_bound
method, you will see that the min
variable value is set to -1
before the engine tries to initialize the minimum bound. It seems that the number of "repeats" in case the minimum value is missing in the quantifier is the number of maximum value + 1
(as if the repeat number equals max - min = max - (-1) = max+1
).
So, a{,}
matches one occurrence of a
. Same as a{, }
or a{ , }
. See R demo, only abc
is matched with ab{,}c
:
grepl("ab{,}c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
grepl("ab{, }c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
grepl("ab{ , }c", c("ac", "abc", "abbc", "abbbc", "abbbbc"))
## => [1] FALSE TRUE FALSE FALSE FALSE
Upvotes: 10