breezymri
breezymri

Reputation: 4353

R grep to match dot

So I have two strings like mylist<-c('claim', 'cl.bi'), when I do

grep('^cl\\.*', mylist)

it returns both 1 and 2. But if I do

grep('^cl\\.', mylist)

it will only return 2. So why would the first match 'claim'? What happened to the period matching?

Upvotes: 2

Views: 4933

Answers (4)

hwnd
hwnd

Reputation: 70732

The * operator tells the engine to match it's preceding token "zero or more" times. In the first case, the engine trys matching a literal dot "zero or more" times — which might be none at all.

Essentially, if you use the * operator, it will still match if there are no instances of (.)

A better visualization:

*      --→   equivalent to {0,}      --→   match preceding token (0 or more times)
\\.*   --→   equivalent to \\.{0,}   --→   match ., .., ..., etc or an empty match
                                                                       ↑↑↑↑↑

Upvotes: 2

smci
smci

Reputation: 33970

To simplify what the others have said: '^cl\\.*' is just equivalent to '^cl', since the * matches 0+ occurrences of the \\.

Whereas '^cl\\.' forces it to match an actual dot. It is equivalent to '^cl\\.{1}'

Upvotes: 1

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162441

"^cl\\.*" matches "claim" because the * quantifier is defined thusly (here quoting from ?regex):

'*' The preceding item will be matched zero or more times.

"claim" contains a beginning of line, followed by a c, followed by an l, followed by zero (in this case) or more dots, so fulfilling all the requirements for a successful match.

If you want to only match strings beginning cl., use the one or more times quantifier, +, like this:

grep('^cl\\.+', mylist, value=TRUE)
# [1] "cl.bi"

Upvotes: 3

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51430

The quantifier * means zero or more times. Pay attention to the zero. It applies to the preceding token, which is \. in your case.

In short, the cl part matches, and the dot after it isn't required.

Here are the matched substrings for both cases:

claim
--

cl.bi
---

Upvotes: 1

Related Questions