thequerist
thequerist

Reputation: 1824

How to exclude one set of words but include another in qregexp?

I am trying to exclude a group of words but include another group of words in a qregexp expression but I am currently having issues figuring this out.

Here are some of the things I tried (this example included all of the words):

(words|I|want|to|include)(?!the|ones|that|should|not|match)

So I tried this (which returned nothing):

^(words|I|want|to|include)(?:(?!the|ones|that|should|not|match).)*$

Am I missing something?

Edit: The reason why I need such an unusual regex (include/exclude) is because I want to search through a series of articles and filter the ones that have the included words in them but not if they also have the excluded words in them.

So for example if article A is:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

and article B is:

Vivamus fermentum semper porta.

Then a regex that includes lorem would filter article A but not B. But if ipsum is a word that I'm excluding, I do not want article A to be filtered.

I considered doing a regex to filter out the articles with the words that I want and then run a second regex excluding articles from the first set that I do not want, but unfortunately the software I am using does not allow me to do this. I can only run one regular expression.

Upvotes: 4

Views: 3336

Answers (5)

Steve Chambers
Steve Chambers

Reputation: 39424

Try this:

^(?:(?:(?!\b(?:the|ones|that|should|not|match)\b).|))*?\b(?:words|I|want|to|include)\b(?:(?:(?!\b(?:the|ones|that|should|not|match)\b).|))*$

Regular expression visualization

See Debuggex Demo (with matching and non-matching examples).

Note: The above assumes QRegExp supports variable-length lookahead - I haven't verified this.

Explanation:

  1. All words must be exact (e.g. include "word" but not "sword" or "words") so are wrapped in \b either side.
  2. For the words you want to include it only matters that at least one of the appears at least once - so that is all that is being searched for in.
  3. None of the words in the exclude list may appear before or after the searched for word, hence need an "exclusion group" either side of it.
  4. Exclusion groups are implemented using a method that is explained very well in this answer.
  5. The first exclusion group uses *? to make it non-greedy so it doesn't consume the whole text and stops as soon as the searched for word is found.
  6. The regular expression is wrapped in ^...$ to ensure the whole string is checked/matched, not just part of it.
  7. All groups are marked as non-capturing groups by using ?: immediately after the first parenthesis.
  8. The matching should presumably be case insensitive so the regular expression should have the appropriate flag to do this (e.g. /i).

Upvotes: 1

user557597
user557597

Reputation:

A simplified version of what you seem to need:

^(?:(?!ipsum).)*(?:lorem)(?:(?!ipsum).)*$

Formatted:

^                    # BOS
 (?:
      (?! ipsum )          # Preceding text, but not these words
      . 
 )*
 (?: lorem )          # Text wanted
 (?:
      (?! ipsum )          # Following text, but not these words
      . 
 )*
 $                    # EOS

Upvotes: 0

ndnenkov
ndnenkov

Reputation: 36110

You were so close. The reason

^(words|I|want|to|include)(?:(?!the|ones|that|should|not|match).)*$

doesn't work is because it means start with one of the words that I want to include and continue til the end with things, which are not one of the words that I don't want to include. To fix it, you can simply change the starting check to use positive lookahead:

^(?=.*(?:words|I|want|to|include))(?:(?!the|ones|that|should|not|match).)*$

Now this means ensure that from the beginning til some point, there is at least one of the words that I want to include and then continue as in the original regex.

To make it even more strict, you could use word boundaries:

^(?=.*\b(?:words|I|want|to|include)\b)(?:(?!\b(?:the|ones|that|should|not|match)\b).)*$

Note that these are all case sensitive. To change that, you can use QRegExp::setCaseSensitivity

Upvotes: 2

vks
vks

Reputation: 67978

^(?:(?!\b(?:the|ones|that|should|not|match)\b).)*\b(?:words|I|want|to|include)\b(?:(?!\b(?:the|ones|that|should|not|match)\b).)*$

You need to add lookahead to both parts after you find words whcih should match.See demo.

https://regex101.com/r/bK9wF1/3

or

^(?!.*\b(?:the|ones|that|should|not|match)\b)(?=.*\b(?:words|I|want|to|include)\b).*$

Add both conditions under lookaheads.See demo.

https://regex101.com/r/uF4oY4/60

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

I think there is no need in a tempered greedy quantifier. Use excluded words as alternatives inside an anchored negative look-ahead. Let me guide you through this.

You say, you have Lorem ipsum dolor sit amet, consectetur adipiscing elit., and you want it to match since it contains the word lorem. The regex is \\blorem\\b (with QRegExp.CaseInsensitive set to 1) where \b is used to force whole word matching. To prevent the match in case the string contains the word ipsum, you need to use the lookahead at the very beginning of the string.

^(?!.*\\bipsum\\b).*\\blorem\\b

Now, it does not match the string in question.

To add more alternatives, we can use an alternation operator |, and we can do it like this: ^(?!.*\\b(?:words|to|exclude)\\b).*\\b(?:words|to|include)\\b. Note the use of non-capturing groups, it does not store any captured texts and potentially improves performance as compared to capturing groups that save the matched text in a buffer.

Thus, you get

^(?!.*\\b(?:the|ones|that|should|not|match)\\b).*\\b(?:words|I|want|to|include)\\b

See demo

Two remarks:

  1. At the demo Web site, single backslashes must be used, I am doubling them here for the QRegExp.
  2. In Qt, . in the pattern matches any character including a newline. At the demo Web site, the dot does not match newline symbols. You may want to replace it with [^\n] if you need the same functionality, but I think it is not necessary.

Upvotes: 4

Related Questions