dobbs
dobbs

Reputation: 1043

python re.match list of regular expressions

I have two lists: ignorelist which is a list of regular expressions, and another list calledurllist. I am trying to make it so if an index item in urllist matches a regular expression in ignorelist, it will not get added to finallist

ignorelist = ['(?:\.)amazon\.com(?:\/(?:.*))',
            '(?:\.)google\.com(?:\/(?:.*))']

urllist = ['api.amazon.com/', 'fakedomain.com/']
finallist = []

for r in ignorelist:
    r = re.compile(r)
    finallist = [x for x in urllist if not r.match(x)]

which outputs

['api.amazon.com/', 'fakedomain.com/']

I'm trying to make the output basically be ['fakedomain.com/'] because it wouldn't match the regular expressions in ignorelist

Upvotes: 1

Views: 2530

Answers (2)

Ulysse BN
Ulysse BN

Reputation: 11423

You are filtering for each regex of your ignore list, and then reassigning finallist each time. So only the last regex will be taken in account.

finallist = []
for url in urllist:
    if any([re.search(r, url) for r in ignorelist]):
       finallist.append(url)

or using a list comprehension:

finallist = [url for url in urllist if not any(re.search(r, url) for r in ignorelist)]

See the working demo.

Upvotes: 1

Jean-François Fabre
Jean-François Fabre

Reputation: 140307

several issues here:

  • re.match searches at the start of the line. Your expressions are not built for that. Use re.search.
  • your assigning the result in a loop: wrong logic.

I would do:

import re

ignorelist = ['(?:\.)amazon\.com(?:\/(?:.*))',
            '(?:\.)google\.com(?:\/(?:.*))']

urllist = ['api.amazon.com/', 'fakedomain.com/']


finallist = [x for x in urllist if not any(re.search(r,x) for r in ignorelist)]

so finallist contains only urls not matching any of the regexes of ignorelist

result:

['fakedomain.com/']

note that I didn't "compile" the regexes, but you may gain some speed by doing so when testing a lot of domains.

Upvotes: 3

Related Questions