Reputation: 9575

Extract domain from body of email

I was wondering if there is any way I could extract domain names from the body of email messages in python. I was thinking of using regular expressions, but I am not too great in writing them, and was wondering if someone could help me out. Here's a sample email body:

<tr><td colspan="5"><font face="verdana" size="4" color="#999999"><b>Resource Links - </b></font><span class="snv"><a href="http://clk.about.com/?zi=4/RZ">Get Listed Here</a></span></td><td class="snv" valign="bottom" align="right"><a href="http://sprinks.about.com/faq/index.htm">What Is This?</a></td></tr><tr><td colspan="6" bgcolor="#999999"><img height="1" width="1"></td></tr><tr><td colspan="6"><map name="sgmap"><area href="http://x.about.com/sg/r/3412.htm?p=0&amp;ref=fooddrinksl_sg" shape="rect" coords="0, 0, 600, 20"><area href="http://x.about.com/sg/r/3412.htm?p=1&amp;ref=fooddrinksl_sg" shape="rect" coords="0, 55, 600, 75"><area href="http://x.about.com/sg/r/3412.htm?p=2&amp;ref=fooddrinksl_sg" shape="rect" coords="0, 110, 600, 130"></map><img border="0" src="http://z.about.com/sg/sg.gif?cuni=3412" usemap="#sgmap" width="600" height="160"></td></tr><tr><td colspan="6">&nbsp;</td></tr>
<tr><td colspan="6"><a name="d"><font face="verdana" size="4" color="#cc0000"><b>Top Picks - </b></font></a><a href="http://slclk.about.com/?zi=1/BAO" class="srvb">Fun Gift Ideas</a><span class="snv">
 from your <a href="http://chinesefood.about.com">Chinese Cuisine</a> Guide</span></td></tr><tr><td colspan="6" bgcolor="cc0000"><img height="1" width="1"></td></tr><tr><td colspan="6" class="snv">

So I would need "clk.about.com" etc.

Thanks!

Upvotes: 1

Answers (5)

vonPetrushev

Reputation: 5599

The cleanest way to do it is with cssselect from lxml.html and urlparse. Here is how:

from lxml import html
from urlparse import urlparse
doc = html.fromstring(html_data)
links = doc.cssselect("a")
domains = set([])
for link in links:
    try: href=link.attrib['href']
    except KeyError: continue
    parsed=urlparse(href)
    domains.add(parsed.netloc)
print domains

First you load the html data into the a document object with fromstring. You query the document for links using standard css selectors with cssselect. You traverse the links, grab their urls with .attrib['href'] - and skip them if they don't have any (except - continue). Parse the url into a named tuple with urlparse and put the domain (netloc) into a set. Voila!

Try avoiding regular expressions when you have good libraries online. They are hard for maintenance. Also a no-go for a html parsing.

UPDATE: The href filter suggestion in the comments is very helpful, the code will look like this:

from lxml import html
from urlparse import urlparse
doc = html.fromstring(html_data)
links = doc.cssselect("a[href]")
domains = set([])
for link in links:
    href=link.attrib['href']
    parsed=urlparse(href)
    domains.add(parsed.netloc)
print domains

You don't need the try-catch block since the href filter makes sure you catch only the anchors that have href attribute in them.

Upvotes: 2

Bernhard

Reputation: 8851

Given you always have an http protocol specifier in front of the domains, this should work (txt is your example).

import re
[groups[0] for groups in re.findall(r'http://(\w+(\.\w+){1,})(/\w+)*', txt)]

The pattern for domains is not perfect, though.

Upvotes: 1

Uku Loskit

Reputation: 42050

from lxml import etree
from StringIO import StringIO
from urlparse import urlparse
html = """<tr><td colspan="5"><font face="verdana" size="4" color="#999999"><b>Resource Links - </b></font><span class="snv"><a href="http://clk.about.com/?zi=4/RZ">Get Listed Here</a></span></td><td class="snv" valign="bottom" align="right"><a href="http://sprinks.about.com/faq/index.htm">What Is This?</a></td></tr><tr><td colspan="6" bgcolor="#999999"><img height="1" width="1"></td></tr><tr><td colspan="6"><map name="sgmap"><area href="http://x.about.com/sg/r/3412.htm?p=0&amp;ref=fooddrinksl_sg" shape="rect" coords="0, 0, 600, 20"><area href="http://x.about.com/sg/r/3412.htm?p=1&amp;ref=fooddrinksl_sg" shape="rect" coords="0, 55, 600, 75"><area href="http://x.about.com/sg/r/3412.htm?p=2&amp;ref=fooddrinksl_sg" shape="rect" coords="0, 110, 600, 130"></map><img border="0" src="http://z.about.com/sg/sg.gif?cuni=3412" usemap="#sgmap" width="600" height="160"></td></tr><tr><td colspan="6">&nbsp;</td></tr><tr><td colspan="6"><a name="d"><font face="verdana" size="4" color="#cc0000"><b>Top Picks - </b></font></a><a href="http://slclk.about.com/?zi=1/BAO" class="srvb">Fun Gift Ideas</a><span class="snv"> from your <a href="http://chinesefood.about.com">Chinese Cuisine</a> Guide</span></td></tr><tr><td colspan="6" bgcolor="cc0000"><img height="1" width="1"></td></tr><tr><td colspan="6" class="snv">"""
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
r = tree.xpath("//a")
links = []
for i in r:
    try:
        links.append(i.attrib['href'])
    except KeyError:
        pass

for link in links:
    print urlparse(link)

From hereon the domain can be distinguished as netloc. The xPath is not probably the best here, someone one please suggest an improvement, but should suit your needs.

Upvotes: 1

Vamana

Reputation: 578

HTMLParser is the clean way to do it. If you want something quick and dirty, or just want to see what a moderately complex regex looks like, here's an example regex to find href's (off the top of my head, not tested):

r'<a\s+href="\w+://[^/"]+[^"]*">'

Upvotes: 1

Abbafei

Reputation: 3136

You can use HTMLParser from the Python standard library to get to certain parts of the document.

Upvotes: 1

Extract domain from body of email

Answers (5)

Related Questions