Reputation: 1824
I am trying to exclude a group of words but include another group of words in a qregexp expression but I am currently having issues figuring this out.
Here are some of the things I tried (this example included all of the words):
(words|I|want|to|include)(?!the|ones|that|should|not|match)
So I tried this (which returned nothing):
^(words|I|want|to|include)(?:(?!the|ones|that|should|not|match).)*$
Am I missing something?
Edit: The reason why I need such an unusual regex (include/exclude) is because I want to search through a series of articles and filter the ones that have the included words in them but not if they also have the excluded words in them.
So for example if article A is:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
and article B is:
Vivamus fermentum semper porta.
Then a regex that includes lorem
would filter article A but not B. But if ipsum
is a word that I'm excluding, I do not want article A to be filtered.
I considered doing a regex to filter out the articles with the words that I want and then run a second regex excluding articles from the first set that I do not want, but unfortunately the software I am using does not allow me to do this. I can only run one regular expression.
Upvotes: 4
Views: 3336
Reputation: 39424
Try this:
^(?:(?:(?!\b(?:the|ones|that|should|not|match)\b).|))*?\b(?:words|I|want|to|include)\b(?:(?:(?!\b(?:the|ones|that|should|not|match)\b).|))*$
See Debuggex Demo (with matching and non-matching examples).
Note: The above assumes QRegExp supports variable-length lookahead - I haven't verified this.
Explanation:
\b
either side.*?
to make it non-greedy so it doesn't consume the whole text and stops as soon as the searched for word is found.^
...$
to ensure the whole string is checked/matched, not just part of it.?:
immediately after the first parenthesis.Upvotes: 1
Reputation:
A simplified version of what you seem to need:
^(?:(?!ipsum).)*(?:lorem)(?:(?!ipsum).)*$
^ # BOS
(?:
(?! ipsum ) # Preceding text, but not these words
.
)*
(?: lorem ) # Text wanted
(?:
(?! ipsum ) # Following text, but not these words
.
)*
$ # EOS
Upvotes: 0
Reputation: 36110
You were so close. The reason
^(words|I|want|to|include)(?:(?!the|ones|that|should|not|match).)*$
doesn't work is because it means start with one of the words that I want to include and continue til the end with things, which are not one of the words that I don't want to include. To fix it, you can simply change the starting check to use positive lookahead:
^(?=.*(?:words|I|want|to|include))(?:(?!the|ones|that|should|not|match).)*$
Now this means ensure that from the beginning til some point, there is at least one of the words that I want to include and then continue as in the original regex.
To make it even more strict, you could use word boundaries:
^(?=.*\b(?:words|I|want|to|include)\b)(?:(?!\b(?:the|ones|that|should|not|match)\b).)*$
Note that these are all case sensitive. To change that, you can use QRegExp::setCaseSensitivity
Upvotes: 2
Reputation: 67978
^(?:(?!\b(?:the|ones|that|should|not|match)\b).)*\b(?:words|I|want|to|include)\b(?:(?!\b(?:the|ones|that|should|not|match)\b).)*$
You need to add lookahead to both parts after you find words whcih should match.See demo.
https://regex101.com/r/bK9wF1/3
or
^(?!.*\b(?:the|ones|that|should|not|match)\b)(?=.*\b(?:words|I|want|to|include)\b).*$
Add both conditions under lookaheads
.See demo.
https://regex101.com/r/uF4oY4/60
Upvotes: 3
Reputation: 627087
I think there is no need in a tempered greedy quantifier. Use excluded words as alternatives inside an anchored negative look-ahead. Let me guide you through this.
You say, you have Lorem ipsum dolor sit amet, consectetur adipiscing elit.
, and you want it to match since it contains the word lorem
. The regex is \\blorem\\b
(with QRegExp.CaseInsensitive set to 1
) where \b
is used to force whole word matching. To prevent the match in case the string contains the word ipsum
, you need to use the lookahead at the very beginning of the string.
^(?!.*\\bipsum\\b).*\\blorem\\b
Now, it does not match the string in question.
To add more alternatives, we can use an alternation operator |
, and we can do it like this: ^(?!.*\\b(?:words|to|exclude)\\b).*\\b(?:words|to|include)\\b
. Note the use of non-capturing groups, it does not store any captured texts and potentially improves performance as compared to capturing groups that save the matched text in a buffer.
Thus, you get
^(?!.*\\b(?:the|ones|that|should|not|match)\\b).*\\b(?:words|I|want|to|include)\\b
See demo
Two remarks:
QRegExp
..
in the pattern matches any character including a newline. At the demo Web site, the dot does not match newline symbols. You may want to replace it with [^\n]
if you need the same functionality, but I think it is not necessary.Upvotes: 4