Thanizer
Thanizer

Reputation: 402

BeautifulSoup drops text when fixing up broken markup

I'm pretty new to Python, but what the heck... This is kind of a weird question so I will do my best to explain it as throughly as I can:

I'm busy trying to write a script in Python that checks a webpage for a specific change (a number flipping from 0 to 1 basically). When that change occurs, the script will proceed onto doing something else. Unfortunately, I have not been able to get to that point yet because I'm having trouble even parsing the HTML because a lot of the HTML is missing when BeautifulSoup gets a hold of it! (At least, this is what I claim.)

Let's step through this: I'm using BeautifulSoup and Mechanize for this. First, I find a form on the webpage and select it, changing controls in the form as I need. (I have verified that all of the controls change as I expect.) After this, I submit the form and then call a helper function I wrote called process_results():

...
form = list(client.forms())[1]
client.select_form('ttform');
...
# Modify controls
...
client.submit()
process_results(client)

process_results() just checks what the client got back. First of all, depending on what was put into the form, you can get invalid search results, so I would like to search for the error message that displays on the webpage and see if it exists. I use BeautifulSoup to do this:

# Processes search results.
def process_serach_results(cli):

    html = cli.response().read()
    soup = BeautifulSoup(html)
    ...

The statement that evaluates if the piece of code in question appears on the page looks like:

...
if (soup.find('td', attr = {'class' : 'msgarea'}) != None):
    # Do something...
    ...

This will never evaluate to be true because it cannot find the tag I'm describing. I decided to print out both the response directly from Mechanize and from BeautifulSoup, and this is what I got (shortened):

Mechanize prints the code I'm out to find, which means that the response is coming back correctly:

...
<TD class=msgarea>
<B class=important_msg>There was a problem with your request:</B>
<BR>
<BR>
<li class=red_msg>...</li>
...
</TD></TR></TABLE><P></DIV>
...

This is the last piece of HTML that shows up from BeautifulSoup:

...
<span class="pageheaderlinks">
<a ... > MENU </a>
|
<a ... > SITE MAP </a>
|
</span></td></tr></table></div></body></html>

In fact, here's that same HTML from Mechanize:

...
<SPAN class="pageheaderlinks">
<A ... >MENU</A>
|
<A ... >SITE MAP</A>
|
<--! Notice how this continues -->
<A ... >HELP</A>
|
<A ... >EXIT</A>
</span>
...

The problem is that it seems BeautifulSoup is omitting a large piece of HTML from the end of what Mechanize's Browser is reporting. This could be a problem with how I'm going about things, but at this point, I'm incredibly lost.

Does anyone know what could be causing this to occur? Thanks! :)

Upvotes: 2

Views: 1011

Answers (2)

Blender
Blender

Reputation: 298176

BeautifulSoup supports a bunch of different HTML parsers. Python's builtin parser isn't very fast or lenient (meaning that it has a hard time making sense of invalid HTML), so it chokes on your HTML.

Try installing lxml, which is more lenient and much faster. If that doesn't work, html5lib is your best bet, as it's the most lenient but also the slowest.

Upvotes: 7

msw
msw

Reputation: 43487

Blender's answer was correct, but this code shows how badly the old parser ruins the markup and may prove useful when hunting down similar problems.

# fails with bs3, works with bs4
bs3 = True

if bs3:
    from BeautifulSoup import BeautifulSoup 
else:
    from bs4 import BeautifulSoup 

mechanize = """
    <TD class=msgarea>
    <B class=important_msg>There was a problem with your request:</B>
    <BR>
    <BR>
    <li class=red_msg>...</li>
    </TD></TR></TABLE><P></DIV>"""


soup = BeautifulSoup(mechanize) 
# the default parser worked just fine, see?
print soup.prettify()

print 'is important_msg?', soup.find('b').attrs
print 'is msgarea?', soup.find('td').attrs
print 'is td?', soup.find(class_='msgarea').name
print 'is contents?', soup.find('td', class_='msgarea').contents[:5], '...'

It took me a while to debug because bs4 wasn't failing, so I figured I'd perhaps save the next guy to come by here. This is the truly bizarre output using bs3 which can find the tag by class but not by name:

<td class="msgarea">
 <b class="important_msg">
  There was a problem with your request:
 </b>
 <br />
 <br />
 <li class="red_msg">
  ...
 </li>
</td>
<p>
</p>
is important_msg? [(u'class', u'important_msg')]
is msgarea? [(u'class', u'msgarea')]
is td?
Traceback (most recent call last):
  File "bs-fail.py", line 24, in <module>
    print 'is td?', soup.find(class_='msgarea').name
AttributeError: 'NoneType' object has no attribute 'name'

Upvotes: 0

Related Questions