Beautiful Soup Remove Tag Error

Question

So I'm scraping some content and I'm trying to strip out the html tags with beautifulsoup in python, but leave the content. For example, given:

Hello, how are you

I would want output:

Hello, how are you

Normally, I would use the get_text method. The problem is that apparently some of the pages I'm scraping have html errors in them. For example:

Hello, how are you

When this happens, get_text() winds up stripping out big sections of the text I want. I tried doing this with regex instead and wound up with the same problem:

    description = re.sub("<.[^/<>]*>", "", str(description))    
    description = re.sub("]*>", "", str(description))

Does anyone know a way around this issue? Thanks in advance.

Martijn Pieters · Accepted Answer

BeautifulSoup trees represent all elements as objects; you cannot use regular expressions to 'fix' broken HTML after the tree was built.

BeautifulSoup leaves it to a parser to build the tree, and it is up to the parser to decide how to handle broken HTML. Different parsers handle broken HTML differently.

You should try different parsers with your input to see how they'll handle your input. The standard html.parser option handles broken HTML less well than other options, while the html5lib option is closest to how a modern browser would handle broken HTML, albeit at a slower rate than lxml would handle HTML parsing.

Beautiful Soup Remove Tag Error

Answers (1)

Related Questions