Tyler Rinker
Tyler Rinker

Reputation: 109994

regex match substring unless another substring matches

I'm trying to dig deeper into regexes and want to match a condition unless some substring is also found in the same string. I know I can use two grepl statements (as seen below) but am wanting to use a single regex to test for this condition as I'm pushing my understanding. Let's say I want to match the words "dog" and "man" using "(dog.*man|man.*dog)" (taken from here) but not if the string contains the substring "park". I figured I could use (*SKIP)(*FAIL) to negate the "park" but this does not cause the string to fail (shown below).

The code:

x <- c(
    "The dog and the man play in the park.",
    "The man plays with the dog.",
    "That is the man's hat.",
    "Man I love that dog!",
    "I'm dog tired",
    "The dog park is no place for man.",
    "Park next to this dog's man."
)

# Could do this but want one regex
grepl("(dog.*man|man.*dog)", x, ignore.case=TRUE) & !grepl("park", x, ignore.case=TRUE)

# Thought this would work, it does not
grepl("park(*SKIP)(*FAIL)|(dog.*man|man.*dog)", x, ignore.case=TRUE, perl=TRUE)

Upvotes: 5

Views: 494

Answers (2)

Mariano
Mariano

Reputation: 6511

stribizhev has already answered this question as it should be approached: with a negative lookahead.

I'll contribute to this particular question:

What is wrong with my understanding of (*SKIP)(*FAIL)?

(*SKIP) and (*FAIL) are regex control verbs.

  1. (*FAIL) or (*F)
    This is the easiest to understand. (*FAIL) is exactly the same as a negative lookahead with an empty subpattern: (?!). As soon as the regex engine gets to that verb in the pattern it forces an immediate backtrack.
  2. (*SKIP) When the regex engine first encounters this verb, nothing happens, because it only acts when it's reached on backtracking. But if there is a later failure, and it reaches (*SKIP) from right to left, the backtracking can't pass (*SKIP). It causes:

    • A match failure.
    • The next match won't be attempted from the next character. Instead, it will start from the position in the text where the engine was when it reached (*SKIP).

    That is why these two control verbs are usually together as (*SKIP)(*FAIL)

Let's consider the following example:

  • Pattern: .*park(*SKIP)(*FAIL)|.*dog
  • Subject: "That park has too many dogs"
  • Matches: " has too many dog"

Internals:

  1. First attempt.
    That park has too many dogs              ||  .*park(*SKIP)(*FAIL)|.*dog
            /\                                        /\
          (here) we have a match for park
                 the engine passes (*SKIP) -no action
                 it then encounters (*FAIL) -backtrack
                 Now it reaches (*SKIP) from the right -FAIL!
  1. Second attempt.
    Normally, it should start from the second character in the subject. However, (*SKIP) has this particular behaviour. The 2nd attempt starts:
    That park has too many dogs              ||  .*park(*SKIP)(*FAIL)|.*dog
            /\                                                       /\
          (here)
          Now, there's no match for .*park
          And off course it matches .*dog

    That park has too many dogs              ||  .*park(*SKIP)(*FAIL)|.*dog
             ^               ^                                        -----
             |    (MATCH!)   |
             +---------------+

DEMO


How can I match the logic of find "dog" & "man" but not "park" with 1 regex?

Use stribizhev's solution!! Try to avoid using control verbs for the sake of compatibility, they're not implemented in all regex flavours. But if you're interested in these regex oddities, there's another stronger control verb: (*COMMIT). It is similar to (*SKIP), acting only while on backtracking, except it causes the entire match to fail (there won't be any other attempt at all). For example:

+-----------------------------------------------+
|Pattern:                                       |
|^.*park(*COMMIT)(*FAIL)|dog                    |
+-------------------------------------+---------+
|Subject                              | Matches |
+-----------------------------------------------+
|The dog and the man play in the park.|  FALSE  |
|Man I love that dog!                 |  TRUE   |
|I'm dog tired                        |  TRUE   |
|The dog park is no place for man.    |  FALSE  |
|park next to this dog's man.         |  FALSE  |
+-------------------------------------+---------+

IDEONE demo

Upvotes: 4

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627263

You can use the anchored look-ahead solution (requiring Perl-style regexp):

grepl("^(?!.*park)(?=.*dog.*man|.*man.*dog)", x, ignore.case=TRUE, perl=T)

Here is an IDEONE demo

  • ^ - anchors the pattern at the start of the string
  • (?!.*park) - fail the match if park is present
  • (?=.*dog.*man|.*man.*dog) - fail the match if man and dog are absent.

Another version (more scalable) with 3 look-aheads:

^(?!.*park)(?=.*dog)(?=.*man)

Upvotes: 6

Related Questions