BeautifulSoup drops text when fixing up broken markup

Question

I'm pretty new to Python, but what the heck... This is kind of a weird question so I will do my best to explain it as throughly as I can:

I'm busy trying to write a script in Python that checks a webpage for a specific change (a number flipping from 0 to 1 basically). When that change occurs, the script will proceed onto doing something else. Unfortunately, I have not been able to get to that point yet because I'm having trouble even parsing the HTML because a lot of the HTML is missing when BeautifulSoup gets a hold of it! (At least, this is what I claim.)

Let's step through this: I'm using BeautifulSoup and Mechanize for this. First, I find a form on the webpage and select it, changing controls in the form as I need. (I have verified that all of the controls change as I expect.) After this, I submit the form and then call a helper function I wrote called process_results():

...
form = list(client.forms())[1]
client.select_form('ttform');
...
# Modify controls
...
client.submit()
process_results(client)

process_results() just checks what the client got back. First of all, depending on what was put into the form, you can get invalid search results, so I would like to search for the error message that displays on the webpage and see if it exists. I use BeautifulSoup to do this:

# Processes search results.
def process_serach_results(cli):

    html = cli.response().read()
    soup = BeautifulSoup(html)
    ...

The statement that evaluates if the piece of code in question appears on the page looks like:

...
if (soup.find('td', attr = {'class' : 'msgarea'}) != None):
    # Do something...
    ...

This will never evaluate to be true because it cannot find the tag I'm describing. I decided to print out both the response directly from Mechanize and from BeautifulSoup, and this is what I got (shortened):

Mechanize prints the code I'm out to find, which means that the response is coming back correctly:

... There was a problem with your request: ... ...

...

This is the last piece of HTML that shows up from BeautifulSoup:

...

 MENU 
|
 SITE MAP 
|

In fact, here's that same HTML from Mechanize:

...

MENU
|
SITE MAP
|
<--! Notice how this continues -->
HELP
|
EXIT

...

The problem is that it seems BeautifulSoup is omitting a large piece of HTML from the end of what Mechanize's Browser is reporting. This could be a problem with how I'm going about things, but at this point, I'm incredibly lost.

Does anyone know what could be causing this to occur? Thanks! :)

Blender · Accepted Answer

BeautifulSoup supports a bunch of different HTML parsers. Python's builtin parser isn't very fast or lenient (meaning that it has a hard time making sense of invalid HTML), so it chokes on your HTML.

Try installing lxml, which is more lenient and much faster. If that doesn't work, html5lib is your best bet, as it's the most lenient but also the slowest.

BeautifulSoup drops text when fixing up broken markup

Answers (2)

Related Questions