Cinnamon
Cinnamon

Reputation: 1698

Match text only if it contains all words from whitelist, but none from blacklist

Let's say we have this whitelist: one two three. And this blacklist: four five. Then:

Is there a regex solution to this problem?

Upvotes: 3

Views: 2821

Answers (2)

tchrist
tchrist

Reputation: 80423

EDIT

Mine is not a good (or in fact, after looking at it more closely, any) solution to the problem, which requires too much hassle for one regex. Counting and set removal just isn’t what you want to build into a single regex. Use auxiliarly logic. See Tim’s solution.


Using Perl, the tool of choice :) for this sort of work:

@blacklist = qw(four five);
$blacklist = do { local $" = "|"; qr/@blacklist/ };

@whitelist = qw(one two three);
$whitelist = do { local $" = "|"; qr/@whitelist/ };

if ($string =~ /\b$whitelist\b/ && $string !~ /\b$blacklist\b/) { ... } 

Perl will compile those into trie data-structures, and so will execute them extremely quickly no matter how many alternatives you have. This O(1) optimization is a really big deal in a lot of real-world applications.


EDIT

I misread the problem as wanting some whitelist and no blacklist, not as wanting all whitelist and no blacklist. That’s not really what I would call a whitelist. Maybe a want‐list and a don’t‐want—list, perhaps.

Anyway, here’s an update that works with the all‐and‐none sense, although of course for performance you reverse the ordering and make it none‐and‐all. (This assumes words do not overlap; I can do the overlapping case, but this is easier to understand without it.)

#!/usr/bin/env perl   
use strict;
use warnings;

my @blacklist = qw(four five);
my $blacklist = do { local $" = "|"; qr/@blacklist/ };

my @whitelist = qw(one two three);
my $whitelist = do { local $" = "|"; qr/@whitelist/ };

while (<DATA>) {
    s/\s*#\s*(.*)$// && print "This $1\n\t";
    if (/$blacklist/ == 0 && @whitelist == (() = /\b$whitelist\b/g)) { 
        print "GOOD: ";
    } else {
        print "EVIL: ";
    } 
    print;
}     
__END__
three one two # is a matching text (contains all whitelist words)
one three two six # is a matching text (contains all whitelist words)
two one # is not a matching text (lacks a whitelist word three)
one four two three # is not a matching text (contains a  blacklist word four)

When run it reports:

This is a matching text (contains all whitelist words)
        GOOD: three one two
This is a matching text (contains all whitelist words)
        GOOD: one three two six
This is not a matching text (lacks a whitelist word three)
        EVIL: two one
This is not a matching text (contains a blacklist word four)
        EVIL: one four two three

To debug the regex, including to see how the two trie data-structures compile and execute, simply include -Mre=debug on the command line or use re "debug"; in the code. Here is what that produces:

% perl -Mre=debug /tmp/trie
Compiling REx "\s*#\s*(.*)$"
synthetic stclass "ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}]".
Final program:
   1: STAR (3)
   2:   SPACE (0)
   3: EXACT <#> (5)
   5: STAR (7)
   6:   SPACE (0)
   7: OPEN1 (9)
   9:   STAR (11)
  10:     REG_ANY (0)
  11: CLOSE1 (13)
  13: EOL (14)
  14: END (0)
floating "#" at 0..2147483647 (checking floating) stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] minlen 1 
Compiling REx "four|five"
Final program:
   1: EXACT <f> (3)
   3: TRIE-EXACT[io] (7)
      <our> 
      <ive> 
   7: END (0)
anchored "f" at 0 (checking anchored) minlen 4 
Compiling REx "one|two|three"
Final program:
   1: TRIEC-EXACT[ot] (11)
      <one> 
      <two> 
      <three> 
  11: END (0)
stclass AHOCORASICKC-EXACT[ot] minlen 3 
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "three one two # is a matching text (contains all whitelist w"...
Found floating substr "#" at offset 14...
start_shift: 0 check_at: 14 s: 0 endpos: 15
By STCLASS: moving 0 --> 5
Guessed: match at offset 5
Matching REx "\s*#\s*(.*)$" against " one two # is a matching text (contains all whitelist words)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one two # is a matching text (contains all whitelist words)"... (61 bytes)
   5 <three> < one two #>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <e one> < two # is >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  13 <e two> < # is a ma>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
  14 < two > <# is a mat>    |  3:  EXACT <#>(5)
  15 <two #> < is a matc>    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  16 <wo # > <is a match>    |  7:    OPEN1(9)
  16 <wo # > <is a match>    |  9:    STAR(11)
                                      REG_ANY can match 49 times out of 2147483647...
  65 <list words)> <%n>      | 11:      CLOSE1(13)
  65 <list words)> <%n>      | 13:      EOL(14)
  65 <list words)> <%n>      | 14:      END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "three one two%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Compiling REx "\b(?^:one|two|three)\b"
Final program:
   1: BOUND (2)
   2: TRIEC-EXACT[ot] (13)
      <one> 
      <two> 
      <three> 
  13: BOUND (14)
  14: END (0)
stclass BOUND minlen 3 
Matching REx "\b(?^:one|two|three)\b" against "three one two%n"
Matching stclass BOUND against "three one tw" (12 bytes)
   0 <> <three one >         |  1:BOUND(2)
   0 <> <three one >         |  2:TRIEC-EXACT[ot](13)
   0 <> <three one >         |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
   1 <t> <hree one t>        |    State:    5 Accepted: N Charid:  6 CP:  68 After State:    8
   2 <th> <ree one tw>       |    State:    8 Accepted: N Charid:  7 CP:  72 After State:    9
   3 <thr> <ee one two>      |    State:    9 Accepted: N Charid:  3 CP:  65 After State:    a
   4 <thre> <e one two>      |    State:    a Accepted: N Charid:  3 CP:  65 After State:    b
   5 <three> < one two%n>    |    State:    b Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #3, continuing
                                  only one match left, short-circuiting: #3 <three>
   5 <three> < one two%n>    | 13:BOUND(14)
   5 <three> < one two%n>    | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one two%n"
Matching stclass BOUND against " one tw" (7 bytes)
   5 <three> < one two%n>    |  1:BOUND(2)
   5 <three> < one two%n>    |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
   6 <hree > <one two%n>     |  1:BOUND(2)
   6 <hree > <one two%n>     |  2:TRIEC-EXACT[ot](13)
   6 <hree > <one two%n>     |    State:    1 Accepted: N Charid:  1 CP:  6f After State:    2
   7 <ree o> <ne two%n>      |    State:    2 Accepted: N Charid:  2 CP:  6e After State:    3
   8 <ree on> <e two%n>      |    State:    3 Accepted: N Charid:  3 CP:  65 After State:    4
   9 <ree one> < two%n>      |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <one>
   9 <ree one> < two%n>      | 13:BOUND(14)
   9 <ree one> < two%n>      | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two%n"
Matching stclass BOUND against " tw" (3 bytes)
   9 <ree one> < two%n>      |  1:BOUND(2)
   9 <ree one> < two%n>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
  10 <ree one > <two%n>      |  1:BOUND(2)
  10 <ree one > <two%n>      |  2:TRIEC-EXACT[ot](13)
  10 <ree one > <two%n>      |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
  11 <ree one t> <wo%n>      |    State:    5 Accepted: N Charid:  5 CP:  77 After State:    6
  12 <ree one tw> <o%n>      |    State:    6 Accepted: N Charid:  1 CP:  6f After State:    7
  13 <ree one two> <%n>      |    State:    7 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #2, continuing
                                  only one match left, short-circuiting: #2 <two>
  13 <ree one two> <%n>      | 13:BOUND(14)
  13 <ree one two> <%n>      | 14:END(0)
Match successful!
    GOOD: three one two
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one three two six # is a matching text (contains all whiteli"...
Found floating substr "#" at offset 18...
start_shift: 0 check_at: 18 s: 0 endpos: 19
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " three two six # is a matching text (contains all whitelist "...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " three two six # is a matching text (contains all whitelist "... (67 bytes)
   3 <one> < three two>      |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <three> < two six #>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  13 <e two> < six # is >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  17 <o six> < # is a ma>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
  18 < six > <# is a mat>    |  3:  EXACT <#>(5)
  19 <six #> < is a matc>    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  20 <ix # > <is a match>    |  7:    OPEN1(9)
  20 <ix # > <is a match>    |  9:    STAR(11)
                                      REG_ANY can match 49 times out of 2147483647...
  69 <list words)> <%n>      | 11:      CLOSE1(13)
  69 <list words)> <%n>      | 13:      EOL(14)
  69 <list words)> <%n>      | 14:      END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "one three two six%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "one three two six%n"
Matching stclass BOUND against "one three two si" (16 bytes)
   0 <> <one three >         |  1:BOUND(2)
   0 <> <one three >         |  2:TRIEC-EXACT[ot](13)
   0 <> <one three >         |    State:    1 Accepted: N Charid:  1 CP:  6f After State:    2
   1 <o> <ne three t>        |    State:    2 Accepted: N Charid:  2 CP:  6e After State:    3
   2 <on> <e three tw>       |    State:    3 Accepted: N Charid:  3 CP:  65 After State:    4
   3 <one> < three two>      |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <one>
   3 <one> < three two>      | 13:BOUND(14)
   3 <one> < three two>      | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " three two six%n"
Matching stclass BOUND against " three two si" (13 bytes)
   3 <one> < three two>      |  1:BOUND(2)
   3 <one> < three two>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
   4 <one > <three two >     |  1:BOUND(2)
   4 <one > <three two >     |  2:TRIEC-EXACT[ot](13)
   4 <one > <three two >     |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
   5 <one t> <hree two s>    |    State:    5 Accepted: N Charid:  6 CP:  68 After State:    8
   6 <ne th> <ree two si>    |    State:    8 Accepted: N Charid:  7 CP:  72 After State:    9
   7 <e thr> <ee two six>    |    State:    9 Accepted: N Charid:  3 CP:  65 After State:    a
   8 < thre> <e two six>     |    State:    a Accepted: N Charid:  3 CP:  65 After State:    b
   9 <three> < two six%n>    |    State:    b Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #3, continuing
                                  only one match left, short-circuiting: #3 <three>
   9 <three> < two six%n>    | 13:BOUND(14)
   9 <three> < two six%n>    | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two six%n"
Matching stclass BOUND against " two si" (7 bytes)
   9 <three> < two six%n>    |  1:BOUND(2)
   9 <three> < two six%n>    |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
  10 <hree > <two six%n>     |  1:BOUND(2)
  10 <hree > <two six%n>     |  2:TRIEC-EXACT[ot](13)
  10 <hree > <two six%n>     |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
  11 <ree t> <wo six%n>      |    State:    5 Accepted: N Charid:  5 CP:  77 After State:    6
  12 <ree tw> <o six%n>      |    State:    6 Accepted: N Charid:  1 CP:  6f After State:    7
  13 <ree two> < six%n>      |    State:    7 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #2, continuing
                                  only one match left, short-circuiting: #2 <two>
  13 <ree two> < six%n>      | 13:BOUND(14)
  13 <ree two> < six%n>      | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " six%n"
Matching stclass BOUND against " si" (3 bytes)
  13 <ree two> < six%n>      |  1:BOUND(2)
  13 <ree two> < six%n>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
  14 <ree two > <six%n>      |  1:BOUND(2)
  14 <ree two > <six%n>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
Contradicts stclass... [regexec_flags]
Match failed
    GOOD: one three two six
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "two one # is not a matching text (lacks a whitelist word thr"...
Found floating substr "#" at offset 8...
start_shift: 0 check_at: 8 s: 0 endpos: 9
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " one # is not a matching text (lacks a whitelist word three)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one # is not a matching text (lacks a whitelist word three)"... (61 bytes)
   3 <two> < one # is >      |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   7 <o one> < # is not >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
   8 < one > <# is not a>    |  3:  EXACT <#>(5)
   9 <one #> < is not a >    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  10 <ne # > <is not a m>    |  7:    OPEN1(9)
  10 <ne # > <is not a m>    |  9:    STAR(11)
                                      REG_ANY can match 53 times out of 2147483647...
  63 <word three)> <%n>      | 11:      CLOSE1(13)
  63 <word three)> <%n>      | 13:      EOL(14)
  63 <word three)> <%n>      | 14:      END(0)
Match successful!
This is not a matching text (lacks a whitelist word three)
Guessing start of match in sv for REx "four|five" against "two one%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "two one%n"
Matching stclass BOUND against "two on" (6 bytes)
   0 <> <two one%n>          |  1:BOUND(2)
   0 <> <two one%n>          |  2:TRIEC-EXACT[ot](13)
   0 <> <two one%n>          |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
   1 <t> <wo one%n>          |    State:    5 Accepted: N Charid:  5 CP:  77 After State:    6
   2 <tw> <o one%n>          |    State:    6 Accepted: N Charid:  1 CP:  6f After State:    7
   3 <two> < one%n>          |    State:    7 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #2, continuing
                                  only one match left, short-circuiting: #2 <two>
   3 <two> < one%n>          | 13:BOUND(14)
   3 <two> < one%n>          | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one%n"
Matching stclass BOUND against " on" (3 bytes)
   3 <two> < one%n>          |  1:BOUND(2)
   3 <two> < one%n>          |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
   4 <two > <one%n>          |  1:BOUND(2)
   4 <two > <one%n>          |  2:TRIEC-EXACT[ot](13)
   4 <two > <one%n>          |    State:    1 Accepted: N Charid:  1 CP:  6f After State:    2
   5 <two o> <ne%n>          |    State:    2 Accepted: N Charid:  2 CP:  6e After State:    3
   6 <two on> <e%n>          |    State:    3 Accepted: N Charid:  3 CP:  65 After State:    4
   7 <two one> <%n>          |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <one>
   7 <two one> <%n>          | 13:BOUND(14)
   7 <two one> <%n>          | 14:END(0)
Match successful!
    EVIL: two one
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one four two three # is not a matching text (contains a blac"...
Found floating substr "#" at offset 19...
start_shift: 0 check_at: 19 s: 0 endpos: 20
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " four two three # is not a matching text (contains a blackli"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " four two three # is not a matching text (contains a blackli"... (74 bytes)
   3 <one> < four two >      |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   8 < four> < two three>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  12 <r two> < three # i>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  18 <three> < # is not >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
  19 <hree > <# is not a>    |  3:  EXACT <#>(5)
  20 <ree #> < is not a >    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  21 <ee # > <is not a m>    |  7:    OPEN1(9)
  21 <ee # > <is not a m>    |  9:    STAR(11)
                                      REG_ANY can match 55 times out of 2147483647...
  76 < word four)> <%n>      | 11:      CLOSE1(13)
  76 < word four)> <%n>      | 13:      EOL(14)
  76 < word four)> <%n>      | 14:      END(0)
Match successful!
This is not a matching text (contains a blacklist word four)
Guessing start of match in sv for REx "four|five" against "one four two three%n"
Found anchored substr "f" at offset 4...
Starting position does not contradict /^/m...
Guessed: match at offset 4
Matching REx "four|five" against "four two three%n"
   4 <one > <four two t>     |  1:EXACT <f>(3)
   5 <one f> <our two th>    |  3:TRIE-EXACT[io](7)
   5 <one f> <our two th>    |    State:    2 Accepted: N Charid:  2 CP:  6f After State:    3
   6 <ne fo> <ur two thr>    |    State:    3 Accepted: N Charid:  3 CP:  75 After State:    4
   7 <e fou> <r two thre>    |    State:    4 Accepted: N Charid:  4 CP:  72 After State:    5
   8 < four> < two three>    |    State:    5 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <our>
   8 < four> < two three>    |  7:END(0)
Match successful!
    EVIL: one four two three
Freeing REx: "one|two|three"
Freeing REx: "\s*#\s*(.*)$"

I know of no other language that allows you to debug your regexes compilation and execution like this, which alone makes it worth the price of admission in my book.

Upvotes: 3

Tim Pietzcker
Tim Pietzcker

Reputation: 336378

This is not something you'd want to use a regex for. Better do it like this (example in Python):

>>> whitelist = ["one", "two", "three"]
>>> blacklist = ["four", "five"]
>>> texts = ["three two one", "one three two six", "one two", "one two three four"]
>>> for text in texts:
...     mytext = text.split()
...     if all(word in mytext for word in whitelist) and \
...        not any(word in mytext for word in blacklist):
...         print(text)
...
three two one
one three two six
>>>

You can do it, though:

^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)
  • ^ anchors the search at the start of the string.
  • (?=...) ensures that its contents can be matched from the current position
  • (?!...) ensures that its contents can't be matched from the current position
  • \bone\b matches one but not lonely.

So you get:

>>> import re
>>> r = re.compile(r"^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)")
>>> for text in texts:
...     if r.match(text):
...         print(text)
...
three two one
one three two six

Upvotes: 9

Related Questions