Print regular expression non-matches only

Question

I'm trying to remove specific URLs that match a list of sources to exclude from the list of news articles. I only want to print the URLs that do not match my regular expression. I also only want to print the items in the list news_articles once.

How do I print only the URLs that do not match?

How do I print the non-matching URLs only once?

import re

sources_to_exclude = ['cnn.com','france24.com','reuters.com']

news_articles = ['http://www.chicagotribune.com/news/nationworld/ct-south-africa-trump-tweet-20180823-story.html',
         'https://www.theatlantic.com/international/archive/2018/08/trump-rule-of-law-south-africa-farmers/568390',
         'https://www.aljazeera.com/news/2018/08/south-africa-calls-trump-misinformed-land-policy-180823060142595.html',
         'https://www.timeslive.co.za/politics/2018-08-23-trumps-administration-to-monitor-land-expropriation-in-south-africa',
         'https://mg.co.za/article/2018-08-23-south-african-politicians-resist-trumps-falsehoods-about-south-africa',
         'https://www.cnn.com/2018/08/22/africa/south-africa-racist-rant-video/index.html',
         'https://www.reuters.com/article/us-safrica-usa-presidency/south-africa-to-seek-clarification-from-us-embassy-on-trumps-land-reform-tweet-sabc-idUSKCN1L80JI',
         'https://www.thedailybeast.com/trump-bemoans-persecuted-white-farmers-in-south-africa',
         'https://www.france24.com/en/20180823-south-africa-recall-mostert-second-argentina-test']

for result in news_articles:
  for link in sources_to_exclude:
    regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
    match = re.search(r'{}'.format(regex), result, re.IGNORECASE)
    if match:
      print ('Matched regex:  {}'.format(result))
    else:
      # I only want to print items that DID NOT match the regex pattern
      # I also want to print these items once.
      print('Did not matched regex:  {}'.format(result))

Leon · Accepted Answer

Your code is almost complete. This should work:

for result in news_articles:
    for link in sources_to_exclude:
        regex = '((http[s]?|ftp):\/)?\/?([^:\/\s]+)?({})\/([^\/]+)'.format(link)
        match = re.search(r'{}'.format(regex), result, re.IGNORECASE)
        if match is not None:
            break
    else:
        print('Did not match any regex: {}'.format(result))

Python supports else on for loops. The else block is executed, if the loop exists normally (was not stopped using break). As the loop breaks, if any regex matches, it is only executed (and the link printed) if no regex matches.

Print regular expression non-matches only

Answers (2)

Related Questions