Reputation: 2321
I have a list of emails and domains which I am trying to filter out as a blacklist
For email it is easy, since I can simply compare the emails directly but for domains, emails with subdomains etc need to be matched as well.
so for the foo.com domain, I would need to filter out
[email protected]
[email protected]
How is this typically done? Via Regex? Splitting the email into the appropriate strings?
Upvotes: 3
Views: 4268
Reputation: 841
This would be the simplest way I can think of:
>>> f = '[email protected]'
>>> '.'.join(f.split('.')[-2:])
'bar.com'
It doesn't use a regex, it's only one line, very readable, it will pull out the domain name, and has the added benefit of not caring if the domain is a .com, .net, or whatever.
Then you'd just check the extracted domain against your blacklisted table.
EDIT: Ok, for .co.uk domains et al
>>> import re
>>> def get_addr(email_addr):
parts = re.split(r'[\@\.]', email_addr)
return '.'.join(parts[(-3 if parts[-2] == 'co' else -2):])
>>> get_addr('[email protected]')
'bar.com'
>>> get_addr('[email protected]')
'bar.co.uk'
>>> get_addr('[email protected]')
'bar.com'
EDIT: @Wilduck pointed out that there may be use cases where you want to filter out specific subdoamins, but not others (ie 'community.ebay.co.uk'). I figured, you may want to blacklist specific email addresses too without needing a separate table (ie [email protected]). Here's my solution:
>>> def is_in_blacklist(addr):
... #check if addr is in your list or db table
... return True or False
>>> def addr_is_blacklisted(addr):
... if not addr: return False
... if is_in_blacklist(addr):
... return True
... sliced = '.'.join(addr.split('@' if '@' in addr else '.')[1:])
... return addr_is_blacklisted(sliced)
So it's deconstructing the email address from beginning to end and checking each part against your blacklist. Obviously you can't get an answer with a single query, but you can filter by single email addresses, to subdomains, to domains, and all the way down to top level domains if you're so inclined. You'll have 3-4 queries per email on average, and you won't kill yourself if you have a huge blacklist.
Upvotes: 1
Reputation: 14136
I'm thinking the easiest way to go about this is to use the string method ends_with
. This method works as follows:
>>> blacklisted = 'foo.com'
>>> email = '[email protected]'
>>> email.endswith('foo.com')
True
>>> email = '[email protected]'
>>> email.endswith('foo.com')
True
So, this will return true if the domain, or email, or whatever ends with 'foo.com'
. As you can see, this will include all subdomains of 'foo.com'
. Conveniently, you can also pass a tuple to endswith
, so if you construct a tuple of your blacklisted domains you could do something like this:
>>> blacklisted = ('foo.com', 'bar.com')
>>> email = '[email protected]'
>>> email.endswith(blacklisted)
True
This will even have the benefit of being able to blacklist some subdomains, but not others.
>>> blacklisted = ('foo.com', 'bar.com', 'sub.baz.net')
>>> email_bad = '[email protected]'
>>> email_bad.endswith(blacklisted)
True
>>> email_good = '[email protected]'
>>> email_good.endswith(blacklisted)
False
Edit: In response to Avaris's comment:
In order to make sure you don't end up with this situation:
>>> blacklisted = ('bar.com', 'baz.com')
>>> email = '[email protected]'
>>> email.endswith(blacklisted)
True
You can included in your blacklisted list both '.bar.com'
and '@bar.com'
. The result of which is
>>> blacklisted = ('.bar.com', '@bar.com', '.baz.com', '@baz.com')
>>> email = '[email protected]'
>>> email.endswith(blacklisted)
False
This is obviously more work. At this point I would say this method versus regex is a matter of preference. While I try to avoid regex at all costs, it might be the way to go for you.
Upvotes: 6