Reputation: 402
I'm pretty new to Python, but what the heck... This is kind of a weird question so I will do my best to explain it as throughly as I can:
I'm busy trying to write a script in Python that checks a webpage for a specific change (a number flipping from 0 to 1 basically). When that change occurs, the script will proceed onto doing something else. Unfortunately, I have not been able to get to that point yet because I'm having trouble even parsing the HTML because a lot of the HTML is missing when BeautifulSoup
gets a hold of it! (At least, this is what I claim.)
Let's step through this: I'm using BeautifulSoup
and Mechanize
for this. First, I find a form on the webpage and select it, changing controls in the form as I need. (I have verified that all of the controls change as I expect.) After this, I submit the form and then call a helper function I wrote called process_results()
:
...
form = list(client.forms())[1]
client.select_form('ttform');
...
# Modify controls
...
client.submit()
process_results(client)
process_results()
just checks what the client got back. First of all, depending on what was put into the form, you can get invalid search results, so I would like to search for the error message that displays on the webpage and see if it exists. I use BeautifulSoup
to do this:
# Processes search results.
def process_serach_results(cli):
html = cli.response().read()
soup = BeautifulSoup(html)
...
The statement that evaluates if the piece of code in question appears on the page looks like:
...
if (soup.find('td', attr = {'class' : 'msgarea'}) != None):
# Do something...
...
This will never evaluate to be true because it cannot find the tag I'm describing. I decided to print out both the response directly from Mechanize
and from BeautifulSoup
, and this is what I got (shortened):
Mechanize
prints the code I'm out to find, which means that the response is coming back correctly:
...
<TD class=msgarea>
<B class=important_msg>There was a problem with your request:</B>
<BR>
<BR>
<li class=red_msg>...</li>
...
</TD></TR></TABLE><P></DIV>
...
This is the last piece of HTML that shows up from BeautifulSoup
:
...
<span class="pageheaderlinks">
<a ... > MENU </a>
|
<a ... > SITE MAP </a>
|
</span></td></tr></table></div></body></html>
In fact, here's that same HTML from Mechanize
:
...
<SPAN class="pageheaderlinks">
<A ... >MENU</A>
|
<A ... >SITE MAP</A>
|
<--! Notice how this continues -->
<A ... >HELP</A>
|
<A ... >EXIT</A>
</span>
...
The problem is that it seems BeautifulSoup
is omitting a large piece of HTML from the end of what Mechanize
's Browser is reporting. This could be a problem with how I'm going about things, but at this point, I'm incredibly lost.
Does anyone know what could be causing this to occur? Thanks! :)
Upvotes: 2
Views: 1011
Reputation: 298176
BeautifulSoup supports a bunch of different HTML parsers. Python's builtin parser isn't very fast or lenient (meaning that it has a hard time making sense of invalid HTML), so it chokes on your HTML.
Try installing lxml
, which is more lenient and much faster. If that doesn't work, html5lib
is your best bet, as it's the most lenient but also the slowest.
Upvotes: 7
Reputation: 43487
Blender's answer was correct, but this code shows how badly the old parser ruins the markup and may prove useful when hunting down similar problems.
# fails with bs3, works with bs4
bs3 = True
if bs3:
from BeautifulSoup import BeautifulSoup
else:
from bs4 import BeautifulSoup
mechanize = """
<TD class=msgarea>
<B class=important_msg>There was a problem with your request:</B>
<BR>
<BR>
<li class=red_msg>...</li>
</TD></TR></TABLE><P></DIV>"""
soup = BeautifulSoup(mechanize)
# the default parser worked just fine, see?
print soup.prettify()
print 'is important_msg?', soup.find('b').attrs
print 'is msgarea?', soup.find('td').attrs
print 'is td?', soup.find(class_='msgarea').name
print 'is contents?', soup.find('td', class_='msgarea').contents[:5], '...'
It took me a while to debug because bs4 wasn't failing, so I figured I'd perhaps save the next guy to come by here. This is the truly bizarre output using bs3 which can find the tag by class
but not by name
:
<td class="msgarea">
<b class="important_msg">
There was a problem with your request:
</b>
<br />
<br />
<li class="red_msg">
...
</li>
</td>
<p>
</p>
is important_msg? [(u'class', u'important_msg')]
is msgarea? [(u'class', u'msgarea')]
is td?
Traceback (most recent call last):
File "bs-fail.py", line 24, in <module>
print 'is td?', soup.find(class_='msgarea').name
AttributeError: 'NoneType' object has no attribute 'name'
Upvotes: 0