Reputation: 1118
I'm trying to use regexp in R cran, using the library stringr
. I was studing str_match
and str_replace
functions. I don't understand why they give different results when I use parentheses for Grouping :
library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"
a<-str_match("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",perl(s))
b<-str_replace("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",perl(s), "\\2")
a[3]
#[1] " PIAZZALE "
b
#[1] " SS"
Upvotes: 1
Views: 240
Reputation: 1374
Try using just the expression s
instead of perl(s)
:
library(stringr)
s<-"(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})"
a<-str_match("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",s)
b<-str_replace("MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838",s, "\\2")
a[3]
#[1] " PIAZZALE "
b
#[1] " PIAZZALE "
I've had a look in the documentation for this library: http://cran.r-project.org/web/packages/stringr/stringr.pdf
It suggests that while the str_replace
method can accept POSIX patterns by default and also perl patterns if supplied, the str_match
can only accept POSIX style patterns and will treat the pattern as such if supplied with a perl pattern. The reason they were supplying different values is that they were using different expression engines. str_detect
can use perl expressions and returns either TRUEE
or FALSE
. could you potentially use the str_detect
method instead of the match method?
The POSIX engine does not recognise lazy (non-greedy) quantifiers.
Your expression
(.+?)( PIAZZALE | SS)(.+?)([0-9]{5})
would be seen as the perl equivalent of
(.+)( PIAZZALE | SS)(.+)([0-9]{5})
Where the first quantified class .+
would match as much as it can (the full string) before backtracking and evaluating the rest of the expression. It is successful when the first quantified class .+
comes all the way back from the end of the string and consumes the characters MONT SS DPR
leaving only SS
for the second capture group a[3]
Here is a simplified explanation of how the different engines are processing your string. All of your quantifiers/alternation are directly wrapped in capture groups so the numbered quantifiers in the following examples are also your capture groups:
Perl:
Quantifier 1: "M"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MO"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MON"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " "
Quantifier 4: FAILED - MUST BACKTRACK
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " D"
Quantifier 4: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT"
Quantifier 2: " SS"
Quantifier 3: " DPR PIAZZALE CADORNA, 1A RICCIONE "
Quantifier 4: "47838"
SUCCESS
POSIX:
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47838"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 4783"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 478"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR PIAZZALE CADORNA, 1A RICCIONE 47"
Quantifier 2: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT SS DPR P"
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR "
Quantifier 2: FAILED - MUST BACKTRACK
Quantifier 1: "MONT SS DPR "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE 47838"
Quantifier 4: FAILED - MUST BACKTRACK
...
Quantifier 1: "MONT SS DPR "
Quantifier 2: " PIZZALE "
Quantifier 3: "CADORNA, 1A RICCIONE "
Quantifier 4: "47838"
SUCCESS
Upvotes: 1