Life is complex
Life is complex

Reputation: 15619

Print regular expression non-matches only

I'm trying to remove specific URLs that match a list of sources to exclude from the list of news articles. I only want to print the URLs that do not match my regular expression. I also only want to print the items in the list news_articles once.

Upvotes: 0

Views: 131

Answers (2)

Onyambu
Onyambu

Reputation: 79228

You can use list comprehension:

[i for i in news_articles if not re.search('|'.join(sources_to_exclude),i)]

Out[610]: 
['http://www.chicagotribune.com/news/nationworld/ct-south-africa-trump-tweet-20180823-story.html',
 'https://www.theatlantic.com/international/archive/2018/08/trump-rule-of-law-south-africa-farmers/568390',
 'https://www.aljazeera.com/news/2018/08/south-africa-calls-trump-misinformed-land-policy-180823060142595.html',
 'https://www.timeslive.co.za/politics/2018-08-23-trumps-administration-to-monitor-land-expropriation-in-south-africa',
 'https://mg.co.za/article/2018-08-23-south-african-politicians-resist-trumps-falsehoods-about-south-africa',
 'https://www.thedailybeast.com/trump-bemoans-persecuted-white-farmers-in-south-africa']

You can also do:

re.sub('^.*('+'|'.join(sources_to_exclude)+').*$', "", "\n".join(news_articles),flags=re.M).split()
Out[612]: 
['http://www.chicagotribune.com/news/nationworld/ct-south-africa-trump-tweet-20180823-story.html',
 'https://www.theatlantic.com/international/archive/2018/08/trump-rule-of-law-south-africa-farmers/568390',
 'https://www.aljazeera.com/news/2018/08/south-africa-calls-trump-misinformed-land-policy-180823060142595.html',
 'https://www.timeslive.co.za/politics/2018-08-23-trumps-administration-to-monitor-land-expropriation-in-south-africa',
 'https://mg.co.za/article/2018-08-23-south-african-politicians-resist-trumps-falsehoods-about-south-africa',
 'https://www.thedailybeast.com/trump-bemoans-persecuted-white-farmers-in-south-africa']

Upvotes: 4

Leon
Leon

Reputation: 3036

Your code is almost complete. This should work:

for result in news_articles:
    for link in sources_to_exclude:
        regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
        match = re.search(r'{}'.format(regex), result, re.IGNORECASE)
        if match is not None:
            break
    else:
        print('Did not match any regex: {}'.format(result))

Python supports else on for loops. The else block is executed, if the loop exists normally (was not stopped using break). As the loop breaks, if any regex matches, it is only executed (and the link printed) if no regex matches.

Upvotes: 1

Related Questions