Reputation: 692
So I'm scraping some content and I'm trying to strip out the html tags with beautifulsoup in python, but leave the content. For example, given:
<p>Hello, how <b>are</b> you</p>
I would want output:
Hello, how are you
Normally, I would use the get_text method. The problem is that apparently some of the pages I'm scraping have html errors in them. For example:
<p>Hello, how </b><b>are</b> you</p>
When this happens, get_text() winds up stripping out big sections of the text I want. I tried doing this with regex instead and wound up with the same problem:
description = re.sub("<.[^/<>]*>", "", str(description))
description = re.sub("</.[^/<>]*>", "", str(description))
Does anyone know a way around this issue? Thanks in advance.
Upvotes: 0
Views: 556
Reputation: 1125368
BeautifulSoup trees represent all elements as objects; you cannot use regular expressions to 'fix' broken HTML after the tree was built.
BeautifulSoup leaves it to a parser to build the tree, and it is up to the parser to decide how to handle broken HTML. Different parsers handle broken HTML differently.
You should try different parsers with your input to see how they'll handle your input. The standard html.parser
option handles broken HTML less well than other options, while the html5lib
option is closest to how a modern browser would handle broken HTML, albeit at a slower rate than lxml
would handle HTML parsing.
Upvotes: 1