Robbert
Robbert

Reputation: 464

PCRE match when words but exclude a list of words within or in relation to matching words

I'm trying to use a PCRE regex to match the following list of words:

  1. Milk
  2. Egg

Out of the following strings:

milk, goatmilk, goat milk, cow milk, watch out for ( milk, eggs), egg, cornstarch
milk. goatmilk. goat milk. cow milk. watch out for ( milk, eggs). egg. cornstarch
milk goatmilk goat milk cow milk watch out for ( milk, eggs). egg cornstarch

This would be an easy excersise but sadly it cannot match any of these words:

In the above case the string should match because of the words:

But if the string does not contain any of those words is should not match, i.e.:

sugar, wheat, goatmilk, goat milk, cornstarch

I've tried to apply these but without any succces:

The closest regex I got from the resources above was:

\b(?!(?:goatmilk|goat\smilk))(egg|milk)\b

This will still match all the words milk and worse it will skip the word eggs because of the word boundries. If I remove the word boundry it will also match goatmilk..

I already thought of the possibility to use two regular expressions, one to match all words and the other to check the matched words for excluded words. However; this would work perfectly if not for the space between goat and milk as the goat part would not be in the match.

If there is no option to do this I'll use PHP to explode on space, walk through the array and if a match has been found a previous index value will be checked to see if the combination contains a word to exclude to mitigate the space issue. However; I would rather not use it as I believe this option is quite ugly :(

Upvotes: 2

Views: 499

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

If you have to just avoid returning milk that is part of goatmilk or goat milk, you can use (*SKIP)(*FAIL) regex:

\bgoat\s*milk\b(*SKIP)(*FAIL)|\b(?:eggs?|milk)\b

See the regex demo

The \bgoat\s*milk\b(*SKIP)(*FAIL) branch will match goatmilk or goat milk and will discard the match due to these 2 PCRE verbs. \b(?:eggs?|milk)\b branch will return the other egg, eggs and milk matches as whole words.

Upvotes: 1

Related Questions