Reputation: 47
I'm trying to parse an HTML email from python code to extract various details and would appreciate a regular expression or two to help achieve this as it is too complex for my limited regex understanding. e.g. look for 'Travel Date' and extract 'October 30 2018 (Tue)'.
In all cases there is a field name contained within <td>
tags followed by the field value contained within another set of <td>
tags. Sometimes the name and value are contained within the same row <tr>
tags (Case 1) and other times they are in separate row tags (Case 2). Other items like <span>
and <img>
need to be skipped over as well.
Case 1
<tr>
<td colspan="2"> </td></tr>
<tr><td style="vertical-align: top; font-size: 13px; font-family: Arial; color: #777777;">Travel Date</td>
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color: #444444;">October 30 2018 (Tue)</td>
</tr>
Case 2
<tr><td style="vertical-align: top;">
<span style="font-size: 10px; font-family: Arial; color: #999999; font-weight: bold; line-height: 19px; text-transform: uppercase;">Drop-off to Address</span>
</td></tr>
<tr><td style="vertical-align: top;">
<span style="font-size: 13px; font-family: Arial; color: #444444;"><img style="vertical-align:text-bottom;" src="https://d1lk4k9zl9klra.cloudfront.net/Email/Common/address_icon.png" alt="" width="14" height="14" /> 200 George St, Sydney NSW 2000, Australia</span>
</td></tr>
Upvotes: 0
Views: 78
Reputation: 602
Instead of using regex, I would use Beautiful Soup. It makes it easier to go through HTML elements and scrape what you need. If you know the relationship between the key and value, then you could use that to extract information. Here's an example for case 1:
In [8]: from bs4 import BeautifulSoup
In [9]: text = """
...: <tr>
...: <td colspan="2"> </td></tr>
...: <tr><td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#777777;">Travel Date</td>
...: <td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#444444;">October 30 2018 (Tue)</td>
...: </tr>"""
In [11]: soup = BeautifulSoup(text, 'lxml')
In [13]: soup.find_all('td')
Out[13]:
[<td colspan="2"> </td>,
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#777777;">Travel Date</td>,
<td style="vertical-align: top; font-size: 13px; font-family: Arial; color:
#444444;">October 30 2018 (Tue)</td>]
In [15]: for tag in soup.find_all('td'):
...: if tag.text == "Travel Date":
...: print tag.find_next().text
...:
October 30 2018 (Tue)
Beautiful Soup gives a lot of flexibility when scraping HTML from the web.
Upvotes: 1