Reputation: 259

using regular expression to find the urls that do not contain specific word in domain part

I want a regular expression to grab urls that does not contain specific word in their domain name but no matter if there is that word in the query string or other subdirectories of the domain.Also it doesn't matter how the hrl starts for exmaple by http/fttp/https/without any of them. I found this expression ^((?!foo).)*$") I don't know how should I change it to fit into these conditions. These are the accepted url for the word "foo":

whatever.whatever.whatever/foo/pic
whatever.whatever.whatever?sdfd="foo"

and these are not accepted:

whatever.whateverfoo.whatever
whatever.foowhatever.whatever
whatever.foo.whatever.whatever
whatever.whatever.foo.whatever

Upvotes: 0

Answers (3)

Rui

Reputation: 4886

Here's a regex that will match the cases that you want to reject

(?:.+://){0,1}(?<subdomain>[^.]+\.){0,1}(?<domain>[^.]*whatever[^.]*\.)(?<top>[^.]+).*

(?: ) is a non-capturing group

(?<groupName> ) is a named group (useful for testing, in regexhero you can see what is being captured by the group)

{0,1} means 0 or 1

. means any character except new line

[^.] means any character except "."

means 0 or more
means 1 or more, for example, .+ means 1 or many "any characters"

. escapes the special character .

See http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

you can try it here: http://regexhero.net/tester/

Upvotes: 0

barfuin

Reputation: 17494

Try this (explanation):

^(?:(?!foo).)*?[\/\?]

What this means is basically:

match anthing not containing foo
until a slash or question mark is encountered

The precise syntax may vary depending on your programming language/editor. The explanation link shows the PHP example. The regex elements I've used are pretty common, so it should work for you. If not, let me know.

This regex can only be matched against a single URL at a time. So if you are trying this in regex101, don't enter all URLs at once.

Update: Example in Java (now using turner instead of foo):

Pattern p = Pattern.compile("^(?:(?!turner).)*?[\\/\\?].*");
System.out.println(p.matcher(
    "i.cdn.turner.com/cnn/.e/img/3.0/1px.gif").matches());
System.out.println(p.matcher(
    "www.facebook.com/plugins/like.php?href=http%3A%2F%2F"
    + "www.facebook.com%2Fturnerkjl‌jl").matches());

Output:

false
true

Upvotes: 1

Braj Kishore

Reputation: 351

Here is your regex in java

"^[^/?]+(?<!foo)"

Explanation - From beginning search for characters which does not matches with / or ?. The moment it finds any one of the above two characters then the pattern search backward for negative match for foo. If foo is found then it returns false else true. This is in java. Also the regex will vary from language to language.

in grep cmd (unix or shell script) you have to take negation of the following regex match

"^[^/?]+foo"

Upvotes: 0

using regular expression to find the urls that do not contain specific word in domain part

Answers (3)

Related Questions