Reputation: 4353
So I have two strings like mylist<-c('claim', 'cl.bi')
, when I do
grep('^cl\\.*', mylist)
it returns both 1 and 2. But if I do
grep('^cl\\.', mylist)
it will only return 2. So why would the first match 'claim'
? What happened to the period matching?
Upvotes: 2
Views: 4933
Reputation: 70732
The *
operator tells the engine to match it's preceding token "zero or more" times. In the first case, the engine trys matching a literal dot "zero or more" times — which might be none at all.
Essentially, if you use the *
operator, it will still match if there are no instances of (.
)
A better visualization:
* --→ equivalent to {0,} --→ match preceding token (0 or more times)
\\.* --→ equivalent to \\.{0,} --→ match ., .., ..., etc or an empty match
↑↑↑↑↑
Upvotes: 2
Reputation: 33970
To simplify what the others have said: '^cl\\.*'
is just equivalent to '^cl'
, since the *
matches 0+ occurrences of the \\.
Whereas '^cl\\.'
forces it to match an actual dot. It is equivalent to '^cl\\.{1}'
Upvotes: 1
Reputation: 162441
"^cl\\.*"
matches "claim"
because the *
quantifier is defined thusly (here quoting from ?regex):
'*' The preceding item will be matched zero or more times.
"claim"
contains a beginning of line, followed by a c
, followed by an l
, followed by zero (in this case) or more dots, so fulfilling all the requirements for a successful match.
If you want to only match strings beginning cl.
, use the one or more times quantifier, +
, like this:
grep('^cl\\.+', mylist, value=TRUE)
# [1] "cl.bi"
Upvotes: 3
Reputation: 51430
The quantifier *
means zero or more times. Pay attention to the zero. It applies to the preceding token, which is \.
in your case.
In short, the cl
part matches, and the dot after it isn't required.
Here are the matched substrings for both cases:
claim
--
cl.bi
---
Upvotes: 1