ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19375

regex match with AND and OR operators

I am trying to write the correct regex pattern to match the following condition

(contains the word other) OR (contains both us AND car)

This code works as expected:

str_detect(c('us cars',
             'u.s. cars',
             'us and bikes',
             'other'),
           regex('other|((?=.*us)(?=.*car))',
                 ignore_case = TRUE))
[1]  TRUE FALSE FALSE  TRUE

However, if I try to include variations of us (united states) such as u.s. and u.s then the pattern does not work anymore.

str_detect(c('us cars',
             'u.s. cars',
             'us and bikes',
             'other'),
           regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car))',
                 ignore_case = TRUE))
[1] FALSE FALSE FALSE  TRUE

What is the issue here? Thanks!

Upvotes: 0

Views: 74

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521339

Dot is a regex metacharacter and needs to be escaped if you intend for it to be a literal dot. I don't know the stringr package well, but here is how you may do this using grepl:

x <- c('us cars', 'u.s. cars', 'us and bikes', 'other')
matches <- grepl("\\bother\\b|((?=.*\\bu\\.?s\\.?(?=\\s|$))(?=.*\\bcar\\b).*)", x, perl=TRUE)

Explanation of regex:

\\bother\\b                        match "other"
|                                  OR
(
    (?=.*\\bu\\.?s\\.?(?=\\s|$))   lookahead and assert that
                                   "us" or "u.s" or "us." or "u.s." appears
    (?=.*\\bcar\\b)                lookahead and asser that "car" appears
    .*                             match anything
)

The problem with your original pattern is that you never match anything on the RHS of the alternation. Not a complete fix, but this:

regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car))', ignore_case=TRUE)

should become something like this:

regex('other|((?=.*us)(?=.*u.s.)(?=.*u.s)(?=.*car).*)', ignore_case=TRUE)
                                                  ^^^ add this

Upvotes: 1

Related Questions