Reputation: 23
I have several html documents that include the following type of information:
<td class="principal-col">
<div class="pr-person">
<div class="name"><span id="pr_person-icon" class="bullet-male-left"></span><span class="person-link">Thomas A /Dumpling/</span></div>
<table class="events" border="0">
<tr>
<td class="factLabel">event1: </td>
<td>
4 February 1940
<br/>
</td>
</tr>
<tr>
<td class="factLabel">event2: </td>
<td>
9 October 2002
<br/>Laplata, Md
</td>
</tr>
I'm trying to extract the name of the person (here: Thomas A Dumpling), as well as event1 (here: 4 February 1940) and event2 date and place (here: 9 October 2002, Laplata, Md), from an html file, and put the content in columns named "Name", "Event1", "Event2" of the csv file "data.csv".
How to extract the name from the html code above I haven't been able to figure out at all so far. For the event1 and event2 date information, the following code worked well for similar html files but did not work at all for the type of html code I posted above; that is, the following Python code went through but it put "missing" in the respective columns of the csv file.
from bs4 import BeautifulSoup
import csv
import re
import glob
import os
f= csv.writer(open('data.csv', 'w'))
f.writerow(["Event1", "Event2"])
path = 'C:\\File-Path\\*'
for infile in glob.glob(os.path.join(path, "148929_S1N8-DQ7.htm")):
soup = BeautifulSoup (open(infile))
myDict = {}
for item in ["event1:", "event2:"]:
try:
myDict[item] = soup.find('td', text=re.compile(r'^%s$' % item)).findNext('td').text
myDict[item] = myDict[item].strip()
myDict[item] = myDict[item].lstrip()
myDict[item] = myDict[item].rstrip()
myDict[item] = myDict[item].encode('UTF-8')
except Exception:
myDict[item] = "missing"
pass
f.writerow([myDict["event1:"], myDict["event2:"]])
Any pointers much appreciated!
Upvotes: 2
Views: 451
Reputation: 56634
First, I converted your sample data to a valid html page and prettyprinted it. This makes it easier to see what is going on:
<html><body><table><tr>
<td class="principal-col">
<div class="pr-person">
<div class="name">
<span id="pr_person-icon" class="bullet-male-left"></span>
<span class="person-link">Thomas A /Dumpling/</span>
</div>
<table class="events" border="0">
<tr>
<td class="factLabel">event1: </td>
<td>4 February 1940<br/></td>
</tr>
<tr>
<td class="factLabel">event2: </td>
<td>9 October 2002<br/>Laplata, Md</td>
</tr>
</table>
</div>
</td>
</tr></table></body></html>
then switched your program around a bit:
from bs4 import BeautifulSoup
import csv
import glob
import os
DATA_PATH = "c:\\file_path\\"
FILESPEC = "*.htm"
OUTFILE = "data.csv"
def main():
data = []
for fname in glob.glob(os.path.join(DATA_PATH, FILESPEC)):
with open(fname) as inf:
pg = BeautifulSoup(inf.read())
for person in pg.findAll('td', {'class':'principal-col'}):
data.append(get_data(person))
data.sort()
with open(os.path.join(DATA_PATH, OUTFILE), 'wb') as outf:
outcsv = csv.writer(outf)
outcsv.writerow(["Name", "Born", "Hired"])
outcsv.writerows(data)
if __name__ == "__main__":
main()
which only leaves the actual parsing code,
def get_string(node, default=''):
if node:
return ', '.join(node.stripped_strings)
else:
return default
def get_data(td_princ):
name = get_string(td_princ.find('span', {'class':'person-link'})).replace('/', '')
birth = hired = '(missing)'
for event in td_princ.find('table', {'class': 'events'}).findAll('tr'):
cnt = [get_string(cell) for cell in event.findAll('td')]
if len(cnt) == 2:
if cnt[0] == "event1:":
birth = cnt[1]
elif cnt[0] == "event2:":
hired = cnt[1]
return (name, birth, hired)
which, when run against the sample data, results in a csv file that looks like
Name,Born,Hired
Thomas A Dumpling,4 February 1940,"9 October 2002, Laplata, Md"
Upvotes: 1
Reputation: 25954
The easiest way to find the first Tag is just with regular find
(select
works too):
soup.find(class_='person-link')
Out[4]: <span class="person-link">Thomas A /Dumpling/</span>
soup.select('.person-link')
Out[5]: [<span class="person-link">Thomas A /Dumpling/</span>]
Note special-case use of class_
in find
because class
is a reserved word in python.
'event1' and 'event2' are easier to grab with select
:
soup.select('td .factLabel ~ td')
Out[10]:
[<td>
4 February 1940
<br/>
</td>,
<td>
9 October 2002
<br/>Laplata, Md
</td>]
Where in the above css selector you're asking for the td
siblings of td class="factLabel"
tags.
If any of the above syntax is confusing, just head to the BeautifulSoup docs. They have a lot of good examples.
Upvotes: 1