Reputation: 67
I'm using BeautifulSoup to extract some data from a search result from this website http://www.cpso.on.ca/docsearch/default.aspx
Here's a sample of the HTML code that's been .prettify()
<tr>
<td>
<a class="doctor" href="details.aspx?view=1&id= 72374">
Smith, Jane
</a>
(#72374)
</td>
<td>
Suite 042
<br />
21 Jump St
<br />
Toronto ON M4C 5T2
<br />
Phone: (555) 555-5555
<br />
Fax: (555) 555-555
</td>
<td align="center">
</td>
</tr>
Essentially every <tr>
block has 3 <td>
blocks.
I want the output to be
Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2
I also have to separate entries by a new line.
I'm having problem writing the address which is stored in the 2nd <td>
block.
I'm also writing this to a file.
Here's what I have so far... it doesn't work :p
for tr in soup.findAll('tr'):
#td1 = tr.td
td2 = tr.td.nextSibling.nextSibling
for a in tr.findAll('a'):
target.write(a.string)
target.write(" ")
for i in range(len(td2.contents)):
if i != None:
target.write(td2.contents[i].string)
target.write(" ")
target.write("\n")
Upvotes: 3
Views: 280
Reputation: 120568
This should do most of what you want:
import os
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
with open('output.txt', 'wb') as stream:
for tr in soup.findAll('tr')[1:]: # [1:] skips the header
columns = tr.findAll('td')
line = [columns[0].a.string.strip()]
for item in (item.strip() for item in columns[1].findAll(text=True)):
if (item and not item.startswith('Phone:')
and not item.startswith('Fax:')):
line.append(item)
stream.write(' '.join(line).encode('utf-8'))
stream.write(os.linesep)
UPDATE
Added some code to show how to write the names and addresses to file.
Also changed the output so that names and addresses are written on one line, and phone and fax numbers are omitted.
Upvotes: 1
Reputation: 36767
I'd try something like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html,
convertEntities=BeautifulSoup.HTML_ENTITIES)
for tr in soup.findAll('tr'):
td = tr.findAll('td')
target.write(td[0].a.string)
target.write(' ')
target.write(' '.join(text.strip() for text in td[1].findAll(text = True)[:-2]))) #finds all text subnodes, except 2 last ones (phone number), and joins them with ' ' separator
target.write("\n")
Upvotes: 1
Reputation: 10897
In [243]: soup.getText(' ').replace(' ', ' ').strip()
Out[243]: u'Smith, Jane (#72374) Suite 042 21 Jump St Toronto ON M4C 5T2 Phone: (555) 555-5555 Fax: (555) 555-555'
To get exactly what you want:
In [246]: address = soup.getText(' ').replace(' ', ' ').strip()
In [247]: import re
In [248]: address = re.sub(r' Phone.*$', '', address)
In [249]: address = address.replace(' ', ' ')
In [250]: address = re.sub(r' \(.*?\)', '', address)
In [251]: print address
Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2
Upvotes: 1