Reputation: 15619
I'm trying to remove specific URLs that match a list of sources to exclude from the list of news articles. I only want to print the URLs that do not match my regular expression. I also only want to print the items in the list news_articles once.
How do I print only the URLs that do not match?
How do I print the non-matching URLs only once?
import re
sources_to_exclude = ['cnn.com','france24.com','reuters.com']
news_articles = ['http://www.chicagotribune.com/news/nationworld/ct-south-africa-trump-tweet-20180823-story.html',
'https://www.theatlantic.com/international/archive/2018/08/trump-rule-of-law-south-africa-farmers/568390',
'https://www.aljazeera.com/news/2018/08/south-africa-calls-trump-misinformed-land-policy-180823060142595.html',
'https://www.timeslive.co.za/politics/2018-08-23-trumps-administration-to-monitor-land-expropriation-in-south-africa',
'https://mg.co.za/article/2018-08-23-south-african-politicians-resist-trumps-falsehoods-about-south-africa',
'https://www.cnn.com/2018/08/22/africa/south-africa-racist-rant-video/index.html',
'https://www.reuters.com/article/us-safrica-usa-presidency/south-africa-to-seek-clarification-from-us-embassy-on-trumps-land-reform-tweet-sabc-idUSKCN1L80JI',
'https://www.thedailybeast.com/trump-bemoans-persecuted-white-farmers-in-south-africa',
'https://www.france24.com/en/20180823-south-africa-recall-mostert-second-argentina-test']
for result in news_articles:
for link in sources_to_exclude:
regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
match = re.search(r'{}'.format(regex), result, re.IGNORECASE)
if match:
print ('Matched regex: {}'.format(result))
else:
# I only want to print items that DID NOT match the regex pattern
# I also want to print these items once.
print('Did not matched regex: {}'.format(result))
Upvotes: 0
Views: 131
Reputation: 79228
You can use list comprehension:
[i for i in news_articles if not re.search('|'.join(sources_to_exclude),i)]
Out[610]:
['http://www.chicagotribune.com/news/nationworld/ct-south-africa-trump-tweet-20180823-story.html',
'https://www.theatlantic.com/international/archive/2018/08/trump-rule-of-law-south-africa-farmers/568390',
'https://www.aljazeera.com/news/2018/08/south-africa-calls-trump-misinformed-land-policy-180823060142595.html',
'https://www.timeslive.co.za/politics/2018-08-23-trumps-administration-to-monitor-land-expropriation-in-south-africa',
'https://mg.co.za/article/2018-08-23-south-african-politicians-resist-trumps-falsehoods-about-south-africa',
'https://www.thedailybeast.com/trump-bemoans-persecuted-white-farmers-in-south-africa']
You can also do:
re.sub('^.*('+'|'.join(sources_to_exclude)+').*$', "", "\n".join(news_articles),flags=re.M).split()
Out[612]:
['http://www.chicagotribune.com/news/nationworld/ct-south-africa-trump-tweet-20180823-story.html',
'https://www.theatlantic.com/international/archive/2018/08/trump-rule-of-law-south-africa-farmers/568390',
'https://www.aljazeera.com/news/2018/08/south-africa-calls-trump-misinformed-land-policy-180823060142595.html',
'https://www.timeslive.co.za/politics/2018-08-23-trumps-administration-to-monitor-land-expropriation-in-south-africa',
'https://mg.co.za/article/2018-08-23-south-african-politicians-resist-trumps-falsehoods-about-south-africa',
'https://www.thedailybeast.com/trump-bemoans-persecuted-white-farmers-in-south-africa']
Upvotes: 4
Reputation: 3036
Your code is almost complete. This should work:
for result in news_articles:
for link in sources_to_exclude:
regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
match = re.search(r'{}'.format(regex), result, re.IGNORECASE)
if match is not None:
break
else:
print('Did not match any regex: {}'.format(result))
Python supports else
on for
loops. The else block is executed, if the loop exists normally (was not stopped using break
). As the loop breaks, if any regex matches, it is only executed (and the link printed) if no regex matches.
Upvotes: 1