user3289992
user3289992

Reputation: 23

Parsing HTML Python, BeautifulSoup

I have several html documents that include the following type of information:

<td class="principal-col">
<div class="pr-person">
<div class="name"><span id="pr_person-icon" class="bullet-male-left"></span><span class="person-link">Thomas A /Dumpling/</span></div>
<table class="events" border="0">
<tr>
<td class="factLabel">event1:&nbsp;</td>
<td>
4 February 1940          
<br/>
</td>
</tr> 
<tr>
<td class="factLabel">event2:&nbsp;</td>
<td>
9 October 2002   
<br/>Laplata, Md
</td>
</tr>

I'm trying to extract the name of the person (here: Thomas A Dumpling), as well as event1 (here: 4 February 1940) and event2 date and place (here: 9 October 2002, Laplata, Md), from an html file, and put the content in columns named "Name", "Event1", "Event2" of the csv file "data.csv".

How to extract the name from the html code above I haven't been able to figure out at all so far. For the event1 and event2 date information, the following code worked well for similar html files but did not work at all for the type of html code I posted above; that is, the following Python code went through but it put "missing" in the respective columns of the csv file.

from bs4 import BeautifulSoup
import csv
import re
import glob
import os

f= csv.writer(open('data.csv', 'w'))
f.writerow(["Event1", "Event2"]) 

path = 'C:\\File-Path\\*'

for infile in glob.glob(os.path.join(path, "148929_S1N8-DQ7.htm")):
    soup = BeautifulSoup (open(infile))
    myDict = {}
    for item in ["event1:", "event2:"]:
        try:
            myDict[item] = soup.find('td', text=re.compile(r'^%s$' % item)).findNext('td').text
            myDict[item] = myDict[item].strip()
            myDict[item] = myDict[item].lstrip()
            myDict[item] = myDict[item].rstrip()
            myDict[item] = myDict[item].encode('UTF-8')
        except Exception:
            myDict[item] = "missing"
            pass
    f.writerow([myDict["event1:"], myDict["event2:"]])

Any pointers much appreciated!

Upvotes: 2

Views: 451

Answers (2)

Hugh Bothwell
Hugh Bothwell

Reputation: 56634

First, I converted your sample data to a valid html page and prettyprinted it. This makes it easier to see what is going on:

<html><body><table><tr>
<td class="principal-col">
  <div class="pr-person">
    <div class="name">
      <span id="pr_person-icon" class="bullet-male-left"></span>
      <span class="person-link">Thomas A /Dumpling/</span>
    </div>
    <table class="events" border="0">
      <tr>
        <td class="factLabel">event1:&nbsp;</td>
        <td>4 February 1940<br/></td>
      </tr> 
      <tr>
        <td class="factLabel">event2:&nbsp;</td>
        <td>9 October 2002<br/>Laplata, Md</td>
      </tr>
    </table>
  </div>
</td>
</tr></table></body></html>

then switched your program around a bit:

from bs4 import BeautifulSoup
import csv
import glob
import os

DATA_PATH = "c:\\file_path\\"
FILESPEC  = "*.htm"
OUTFILE   = "data.csv"

def main():
    data = []
    for fname in glob.glob(os.path.join(DATA_PATH, FILESPEC)):
        with open(fname) as inf:
            pg = BeautifulSoup(inf.read())
            for person in pg.findAll('td', {'class':'principal-col'}):
                data.append(get_data(person))
    data.sort()

    with open(os.path.join(DATA_PATH, OUTFILE), 'wb') as outf:
        outcsv = csv.writer(outf)
        outcsv.writerow(["Name", "Born", "Hired"])
        outcsv.writerows(data)

if __name__ == "__main__":
    main()

which only leaves the actual parsing code,

def get_string(node, default=''):
    if node:
        return ', '.join(node.stripped_strings)
    else:
        return default

def get_data(td_princ):
    name = get_string(td_princ.find('span', {'class':'person-link'})).replace('/', '')

    birth = hired = '(missing)'
    for event in td_princ.find('table', {'class': 'events'}).findAll('tr'):
        cnt = [get_string(cell) for cell in event.findAll('td')]
        if len(cnt) == 2:
            if cnt[0] == "event1:":
                birth = cnt[1]
            elif cnt[0] == "event2:":
                hired = cnt[1]
    return (name, birth, hired)

which, when run against the sample data, results in a csv file that looks like

Name,Born,Hired
Thomas A Dumpling,4 February 1940,"9 October 2002, Laplata, Md"

Upvotes: 1

roippi
roippi

Reputation: 25954

The easiest way to find the first Tag is just with regular find (select works too):

soup.find(class_='person-link')
Out[4]: <span class="person-link">Thomas A /Dumpling/</span>

soup.select('.person-link')
Out[5]: [<span class="person-link">Thomas A /Dumpling/</span>]

Note special-case use of class_ in find because class is a reserved word in python.

'event1' and 'event2' are easier to grab with select:

soup.select('td .factLabel ~ td')
Out[10]: 
[<td>
4 February 1940          
<br/>
</td>,
 <td>
9 October 2002   
<br/>Laplata, Md
</td>]

Where in the above css selector you're asking for the td siblings of td class="factLabel" tags.

If any of the above syntax is confusing, just head to the BeautifulSoup docs. They have a lot of good examples.

Upvotes: 1

Related Questions