TheThankfulOne
TheThankfulOne

Reputation: 132

RegEx to exclude match if a certain word is present, but not another partial word

I have the keyword "cum" which our firewall uses to block adult sites, problem is this works a little too well because this also blocks any URL with the word "document"

The firewall will take regex strings, and I tried this:

^.*(?!document)cum.*$

Vut it still matches "document". I have a feeling I should be using a pipe | but I don't get it.

What I want is to match anywhere

*cum*

is found in the URL (or domain-name), but NOT if the word is document or documents.

Possible? As I understand it, a word boundary doesn't work here because the word cum won't necessarily be separated by white-space when it's in a URL, and definitely not if it's in a domain-name.

Here's another way to put it:

Allow "examplesearchdocuments.com"
Allow "examplemydocuments.com"
Allow "documentexample.com"
Allow "example.com/somedocuments"
Don't allow "funnycumsiteexample.com"
Don't allow "cumallovereverythingexample.com"
Don't allow "exampleseemycum.com"

where cum being the bad word match. Sorry if any of these examples are real sites, I don't know how else to convey this.

Upvotes: 2

Views: 3045

Answers (2)

deltree
deltree

Reputation: 3824

Per the comments, I was wrong.

If you use a lookbehind inside your lookahead, you can match "cum" only if it is not within the word "document".

cum(?!(?<=docum)ent)

Here is some reading on lookaround http://www.regular-expressions.info/lookaround.html

Here it is against a large number of tests.

http://www.rubular.com/r/b5iZrn6Cjz

Upvotes: 2

TWiStErRob
TWiStErRob

Reputation: 46470

My first suggestion would also be to use \bcum\b as the others, but that doesn't match e.g. cumming.

You're almost right with the negative lookaround (?!) syntax:

^.*(?<!do)cum(?!ent).*$

or

^.*(?<!do)cum(?!ents?).*$

to support plural. You can check it at: http://fiddle.re/3pyj by clicking Java for a the examples you provided.

My suggestion would be ^.*\bcum.*$ to match a word boundary, i.e. word start and the 'cum' and anything after.

Upvotes: 0

Related Questions