Reputation: 165
I have a problem with regex - I have 4 examples of urls:
http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo
http://auto.com/index.php/car-news/11654-battle-royale-2014
http://auto.com/index.php/tv-special-news/10480-new-film-4
http://auto.com/index.php/first/12234-new-volvo-xc60
I would like to exclude urls with 'tv-special-news' inside or 'photo' at the end.
I've tried:
http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)
but it does not work exactly as I want
Upvotes: 0
Views: 579
Reputation: 387785
http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}-(?!photo)
You were close with this. You just have to remove the dash before the (?!photo)
to allow lines to end without a trailing dash and add a $
to the end to make sure that the whole line needs to be matched.
And then you will also have to change the negative lookahead into a negative look behind to make sure that you are not matching the line end if it is preceded by photo
: (?<!photo)
.
http://(www.)?auto.com/index.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]{1,}(?<!photo)$
Also, you should escape all dots properly:
http://(www\.)?auto\.com/index\.php/(?!(tv-special-news)).*/[a-zA-Z0-9\-]+(?<!photo)$
Also, the quantifier {1,}
is equivalent to +
.
Upvotes: 2
Reputation: 785276
You may use this regex:
^(?!.*-photo$)http://(?:www\.)?auto\.com/index\.php/(?!tv-special-news)[^/]+/[\w-]+-
(?!.*-photo$)
is negative lookahead to fail the match if URL ends with photo
.(?!tv-special-news)
is negative lookahead to assert failure when tv-special-news
appears after /index.php/
.Or with lookbehind regex, you can use:
^http://(www\.)?auto\.com/index\.php/(?!tv-special-news).*/[a-zA-Z0-9-]+$(?<!photo)
Upvotes: 1
Reputation: 97
You can simply store your link in the list and iterate over it using regex:
re_pattern = r'\b(?:tv-special-news|photo)\b'
re.findall(re_pattern,link)
(where link will be items from the list)
If the patterns matches then, it will store the result in the list. you will have to just check if the list is empty or not. If list is empty you can include the link else exclude it.
Here is the sample code:
import re
links = ['http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo', 'http://auto.com/index.php/car-news/11654-battle-royale-2014', 'http://auto.com/index.php/tv-special-news/10480-new-film-4', 'http://auto.com/index.php/first/12234-new-volvo-xc60']
new_list = []
re_pattern = r'\b(?:tv-special-news|photo)\b' for link in links: result = re.findall(re_pattern,link) if len(result) < 1: new_list.append(link)
print new_list
Upvotes: 0
Reputation: 71451
You can use this solution:
import re
list_of_urls = ["http://auto.com/index.php/car-news/12158-classicauto-cup-2016-photo",....]
new_list = [i for i in list_of_urls if len(re.findall("photo+", i.split()[-1])) == 0 and len(re.findall("tv-special-news+", i.split()[-1])) == 0]
Upvotes: 0