sasker
sasker

Reputation: 2321

Best way to filter out emails and domains in python

I have a list of emails and domains which I am trying to filter out as a blacklist

For email it is easy, since I can simply compare the emails directly but for domains, emails with subdomains etc need to be matched as well.

so for the foo.com domain, I would need to filter out

[email protected]
[email protected]

How is this typically done? Via Regex? Splitting the email into the appropriate strings?

Upvotes: 3

Views: 4268

Answers (3)

Blake
Blake

Reputation: 841

This would be the simplest way I can think of:

>>> f = '[email protected]'
>>> '.'.join(f.split('.')[-2:])
'bar.com'

It doesn't use a regex, it's only one line, very readable, it will pull out the domain name, and has the added benefit of not caring if the domain is a .com, .net, or whatever.

Then you'd just check the extracted domain against your blacklisted table.

EDIT: Ok, for .co.uk domains et al

>>> import re
>>> def get_addr(email_addr):
        parts = re.split(r'[\@\.]', email_addr)
        return '.'.join(parts[(-3 if parts[-2] == 'co' else -2):])
>>> get_addr('[email protected]')
'bar.com'
>>> get_addr('[email protected]')
'bar.co.uk'
>>> get_addr('[email protected]')
'bar.com'

EDIT: @Wilduck pointed out that there may be use cases where you want to filter out specific subdoamins, but not others (ie 'community.ebay.co.uk'). I figured, you may want to blacklist specific email addresses too without needing a separate table (ie [email protected]). Here's my solution:

>>> def is_in_blacklist(addr):
...     #check if addr is in your list or db table
...     return True or False

>>> def addr_is_blacklisted(addr):
...     if not addr: return False
...     if is_in_blacklist(addr):
...         return True
...     sliced = '.'.join(addr.split('@' if '@' in addr else '.')[1:])
...     return addr_is_blacklisted(sliced)

So it's deconstructing the email address from beginning to end and checking each part against your blacklist. Obviously you can't get an answer with a single query, but you can filter by single email addresses, to subdomains, to domains, and all the way down to top level domains if you're so inclined. You'll have 3-4 queries per email on average, and you won't kill yourself if you have a huge blacklist.

Upvotes: 1

Wilduck
Wilduck

Reputation: 14136

I'm thinking the easiest way to go about this is to use the string method ends_with. This method works as follows:

>>> blacklisted = 'foo.com'
>>> email = '[email protected]'
>>> email.endswith('foo.com')
True
>>> email = '[email protected]'
>>> email.endswith('foo.com')
True

So, this will return true if the domain, or email, or whatever ends with 'foo.com'. As you can see, this will include all subdomains of 'foo.com'. Conveniently, you can also pass a tuple to endswith, so if you construct a tuple of your blacklisted domains you could do something like this:

>>> blacklisted = ('foo.com', 'bar.com')
>>> email = '[email protected]'
>>> email.endswith(blacklisted)
True

This will even have the benefit of being able to blacklist some subdomains, but not others.

>>> blacklisted = ('foo.com', 'bar.com', 'sub.baz.net')
>>> email_bad = '[email protected]'
>>> email_bad.endswith(blacklisted)
True
>>> email_good = '[email protected]'
>>> email_good.endswith(blacklisted)
False

Edit: In response to Avaris's comment:

In order to make sure you don't end up with this situation:

>>> blacklisted = ('bar.com', 'baz.com')
>>> email = '[email protected]'
>>> email.endswith(blacklisted)
True

You can included in your blacklisted list both '.bar.com' and '@bar.com'. The result of which is

>>> blacklisted = ('.bar.com', '@bar.com', '.baz.com', '@baz.com')
>>> email = '[email protected]'
>>> email.endswith(blacklisted)
False

This is obviously more work. At this point I would say this method versus regex is a matter of preference. While I try to avoid regex at all costs, it might be the way to go for you.

Upvotes: 6

Kent
Kent

Reputation: 195249

how about

.*foo\.com$

does it work?

Upvotes: 0

Related Questions