Cleaning up and removing tags with BeautifulSoup

Question

I have the following script so far:

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2

br = Browser()
br.open("http://www.foo.com")

html = br.response().read(); 

soup = BeautifulSoup(html)
items = soup.findAll(id="info")

and it runs perfectly, and results in the following "items":


John Doe

123 Main Street

Phone:5551234

YES

However, I'd like to take items and clean it up to get

John Doe
123 Main Street
5551234

How can you remove such tags in BeautifulSoup and Python?

As always, thanks!

Peter Lyons · Accepted Answer

This will do it for this EXACT html. Obviously this isn't tolerant of any deviation, so you'll want to add quite a lot of bounds checking and null checking, but here's the nuts and bolts to get your data into plain text.

items = soup.findAll(id="info")
print items[0].span.b.contents[0]
print items[0].contents[3].strip()
print items[0].contents[5].strip().split(":", 1)[1]

Cleaning up and removing tags with BeautifulSoup

Answers (1)

Related Questions