Reputation: 19375
I am trying to write the correct regex
pattern to match the following condition
(contains the word
other
) OR (contains bothus
ANDcar
)
This code works as expected:
str_detect(c('us cars',
'u.s. cars',
'us and bikes',
'other'),
regex('other|((?=.*us)(?=.*car))',
ignore_case = TRUE))
[1] TRUE FALSE FALSE TRUE
However, if I try to include variations of us
(united states) such as u.s.
and u.s
then the pattern does not work anymore.
str_detect(c('us cars',
'u.s. cars',
'us and bikes',
'other'),
regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car))',
ignore_case = TRUE))
[1] FALSE FALSE FALSE TRUE
What is the issue here? Thanks!
Upvotes: 0
Views: 74
Reputation: 521339
Dot is a regex metacharacter and needs to be escaped if you intend for it to be a literal dot. I don't know the stringr
package well, but here is how you may do this using grepl
:
x <- c('us cars', 'u.s. cars', 'us and bikes', 'other')
matches <- grepl("\\bother\\b|((?=.*\\bu\\.?s\\.?(?=\\s|$))(?=.*\\bcar\\b).*)", x, perl=TRUE)
Explanation of regex:
\\bother\\b match "other"
| OR
(
(?=.*\\bu\\.?s\\.?(?=\\s|$)) lookahead and assert that
"us" or "u.s" or "us." or "u.s." appears
(?=.*\\bcar\\b) lookahead and asser that "car" appears
.* match anything
)
The problem with your original pattern is that you never match anything on the RHS of the alternation. Not a complete fix, but this:
regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car))', ignore_case=TRUE)
should become something like this:
regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car).*)', ignore_case=TRUE)
^^^ add this
Upvotes: 1