Reputation: 165

Python Regex - exclude url containing a word

I have a problem with regex - I have 4 examples of urls:

http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo 
http://auto.com/index.php/car-news/11654-battle-royale-2014
http://auto.com/index.php/tv-special-news/10480-new-film-4
http://auto.com/index.php/first/12234-new-volvo-xc60

I would like to exclude urls with 'tv-special-news' inside or 'photo' at the end.

I've tried:

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

but it does not work exactly as I want

Upvotes: 0

Answers (4)

poke

Reputation: 387785

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)

You were close with this. You just have to remove the dash before the (?!photo) to allow lines to end without a trailing dash and add a $ to the end to make sure that the whole line needs to be matched.

And then you will also have to change the negative lookahead into a negative look behind to make sure that you are not matching the line end if it is preceded by photo: (?<!photo).

http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}(?<!photo)$

Also, you should escape all dots properly:

http://(www\.)?auto\.com/index\.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]+(?<!photo)$

Also, the quantifier {1,} is equivalent to +.

Upvotes: 2

anubhava

Reputation: 785276

You may use this regex:

^(?!.*-photo$)http://(?:www\.)?auto\.com/index\.php/(?!tv-special-news)[^/]+/[\w-]+-

RegEx Demo 1

(?!.*-photo$) is negative lookahead to fail the match if URL ends with photo.
(?!tv-special-news) is negative lookahead to assert failure when tv-special-news appears after /index.php/.
Better to use start anchor in your regex

Or with lookbehind regex, you can use:

^http://(www\.)?auto\.com/index\.php/(?!tv-special-news).*/[a-zA-Z0-9-]+$(?<!photo)

RegEx Demo 2

Upvotes: 1

Nikhil Yadav

Reputation: 97

You can simply store your link in the list and iterate over it using regex:

re_pattern = r'\b(?:tv-special-news|photo)\b'

re.findall(re_pattern,link)

(where link will be items from the list)

If the patterns matches then, it will store the result in the list. you will have to just check if the list is empty or not. If list is empty you can include the link else exclude it.

Here is the sample code:

import re

links = ['http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo', 'http://auto.com/index.php/car-news/11654-battle-royale-2014', 'http://auto.com/index.php/tv-special-news/10480-new-film-4', 'http://auto.com/index.php/first/12234-new-volvo-xc60']

new_list = []

re_pattern = r'\b(?:tv-special-news|photo)\b' for link in links:    result = re.findall(re_pattern,link)        if len(result) < 1:         new_list.append(link)   

print new_list

Upvotes: 0

Ajax1234

Reputation: 71451

You can use this solution:

import re

list_of_urls = ["http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo",....]


new_list = [i for i in list_of_urls if len(re.findall("photo+", i.split()[-1])) == 0 and len(re.findall("tv-special-news+", i.split()[-1])) == 0]

Upvotes: 0

Python Regex - exclude url containing a word

Answers (4)

Related Questions