Reputation: 67

Parsing html with Python

I'm using BeautifulSoup to extract some data from a search result from this website http://www.cpso.on.ca/docsearch/default.aspx

Here's a sample of the HTML code that's been .prettify()

<tr>
 <td>
  <a class="doctor" href="details.aspx?view=1&amp;id= 72374">
   Smith, Jane
  </a>
  (#72374)
 </td>
 <td>
  Suite 042
  <br />
  21 Jump St
  <br />
  Toronto&nbsp;ON&nbsp;&nbsp;M4C 5T2
  <br />
  Phone:&nbsp;(555) 555-5555
  <br />
  Fax:&nbsp;(555) 555-555
 </td>
 <td align="center">
 </td>
</tr>

Essentially every <tr> block has 3 <td> blocks.

I want the output to be

Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2

I also have to separate entries by a new line.

I'm having problem writing the address which is stored in the 2nd <td> block.

I'm also writing this to a file.

Here's what I have so far... it doesn't work :p

for tr in soup.findAll('tr'):
    #td1 = tr.td
    td2 = tr.td.nextSibling.nextSibling 

    for a in tr.findAll('a'):
        target.write(a.string)
        target.write(" ")

    for i in range(len(td2.contents)):
        if i != None:
            target.write(td2.contents[i].string)
            target.write(" ")
    target.write("\n")

Upvotes: 3

Answers (3)

ekhumoro

Reputation: 120768

This should do most of what you want:

import os
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

with open('output.txt', 'wb') as stream:
    for tr in soup.findAll('tr')[1:]: # [1:] skips the header
        columns = tr.findAll('td')
        line = [columns[0].a.string.strip()]
        for item in (item.strip() for item in columns[1].findAll(text=True)):
            if (item and not item.startswith('Phone:')
                and not item.startswith('Fax:')):
                line.append(item)
        stream.write(' '.join(line).encode('utf-8'))
        stream.write(os.linesep)

UPDATE

Added some code to show how to write the names and addresses to file.

Also changed the output so that names and addresses are written on one line, and phone and fax numbers are omitted.

Upvotes: 1

soulcheck

Reputation: 36777

I'd try something like this:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html, 
               convertEntities=BeautifulSoup.HTML_ENTITIES)

for tr in soup.findAll('tr'):
   td = tr.findAll('td')

   target.write(td[0].a.string)
   target.write(' ')

   target.write(' '.join(text.strip() for text in td[1].findAll(text = True)[:-2]))) #finds all text subnodes, except 2 last ones (phone number), and joins them with ' ' separator
   target.write("\n")

Upvotes: 1

Derek Litz

Reputation: 10897

In [243]: soup.getText(' ').replace('&nbsp;', ' ').strip()
Out[243]: u'Smith, Jane (#72374)  Suite 042 21 Jump St Toronto ON  M4C 5T2 Phone: (555) 555-5555 Fax: (555) 555-555'

To get exactly what you want:

In [246]: address = soup.getText(' ').replace('&nbsp;', ' ').strip()
In [247]: import re
In [248]: address = re.sub(r' Phone.*$', '', address)
In [249]: address = address.replace('  ', ' ')
In [250]: address = re.sub(r' \(.*?\)', '', address)
In [251]: print address
Smith, Jane Suite 042 21 Jump St Toronto ON M4C 5T2

Upvotes: 1

Parsing html with Python

Answers (3)

Related Questions