Cornwell
Cornwell

Reputation: 3410

Understanding why negative lookahead is not working

Let's say I have this url:

https://www.google.com/search?q=test&tbm=isch&randomParameters=123

I want to match google's search url, when it doesn't contain:

tbm=isch

tbm=news

param1=432

I've tried this pattern:

^http(s):\/\/www.google.(.*)\/(search|webhp)\?(?![\s]+(tbm=isch|tbm=news|param1=432))

but it's not working (as in still matching), the sample url

Upvotes: 0

Views: 89

Answers (3)

Jan
Jan

Reputation: 43169

You could use:

^                         # anchor it to the beginning
https?://                 # http or https
(?:
    (?!tbm=(?:isch|news)) # first neg. lookahead
    (?!param1=432)        # second
    \S                    # anything but whitespace
)+
$                         # THE END

See a demo on regex101.com.
There might be builtin-methods like urlparse() for your specific programming language though.

Upvotes: 3

Maria Ivanova
Maria Ivanova

Reputation: 1146

You should change the [\s]+ to .*? or [\S]*? and your regex will work. To also match the whole url, if it fits the criteria, you can add another [\S]* at the end:

^http(s):\/\/www.google.([\w\.]*)\/(search|webhp)\?(?![\S]*?(tbm=isch|tbm=news|param1=432))[\S]*

Upvotes: 1

Anirudha
Anirudha

Reputation: 32797

Your regex should be

^https:\/\/www.google.([^\/]*)\/(search|webhp)\?(?!.*(tbm\=isch|tbm\=news|param1\=432)).*$

example

The issue was that you were trying to do lookahead with \s* instead of .* which will match any number of characters.

Also www.google.(.*) would have caused a lot of backtracking causing performance issue so I have replaced it with www.google.([^\/]*)


Edit

Am wondering why you are using regex for this instead of simple indexof or similar methods from the language you are using. Any special usecase here??

Upvotes: 2

Related Questions