Reputation: 132
I have the keyword "cum" which our firewall uses to block adult sites, problem is this works a little too well because this also blocks any URL with the word "document"
The firewall will take regex strings, and I tried this:
^.*(?!document)cum.*$
Vut it still matches "document". I have a feeling I should be using a pipe |
but I don't get it.
What I want is to match anywhere
*cum*
is found in the URL (or domain-name), but NOT if the word is document
or documents
.
Possible? As I understand it, a word boundary doesn't work here because the word cum
won't necessarily be separated by white-space when it's in a URL, and definitely not if it's in a domain-name.
Here's another way to put it:
Allow "examplesearchdocuments.com"
Allow "examplemydocuments.com"
Allow "documentexample.com"
Allow "example.com/somedocuments"
Don't allow "funnycumsiteexample.com"
Don't allow "cumallovereverythingexample.com"
Don't allow "exampleseemycum.com"
where cum
being the bad word match. Sorry if any of these examples are real sites, I don't know how else to convey this.
Upvotes: 2
Views: 3045
Reputation: 3824
Per the comments, I was wrong.
If you use a lookbehind inside your lookahead, you can match "cum" only if it is not within the word "document".
cum(?!(?<=docum)ent)
Here is some reading on lookaround http://www.regular-expressions.info/lookaround.html
Here it is against a large number of tests.
http://www.rubular.com/r/b5iZrn6Cjz
Upvotes: 2
Reputation: 46470
My first suggestion would also be to use \bcum\b
as the others, but that doesn't match e.g. cumming.
You're almost right with the negative lookaround (?!)
syntax:
<
>
^.*(?<!do)cum(?!ent).*$
or
^.*(?<!do)cum(?!ents?).*$
to support plural. You can check it at: http://fiddle.re/3pyj by clicking Java for a the examples you provided.
My suggestion would be ^.*\bcum.*$
to match a word boundary, i.e. word start and the 'cum' and anything after.
Upvotes: 0