beautifulSoup inconsistent behavior

Question

I am completely puzzled by the behavior of the following HTML-scraping code that I wrote in two different environments and need help finding the root cause of this discrepancy.

import sys
import bs4
import md5
import logging
from urllib2 import urlopen
from platform import platform

# Log particulars of the environment
logging.warning("OS platform is %s" %platform())
logging.warning("Python version is %s" %sys.version)
logging.warning("BeautifulSoup is at %s and its version is %s" %(bs4.__file__, bs4.__version__))

# Open web-page and read HTML
url = 'http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JXIG&size=all'
response = urlopen(url)
html = response.read()

# Calculate MD5 to ensure that the same string was downloaded
print "MD5 sum for html string downloaded is %s" %md5.new(html).hexdigest()

# Make beautiful soup
soup = bs4.BeautifulSoup(html, 'html')
contigsTable = soup.find("table", {"class" : "zebra"})
contigs = []

# Parse table in soup to find all records
for row in contigsTable.findAll('tr'):
    column = row.findAll('td')
    if len(column) > 2:
        contigs.append(column[1])

# Expect identical results on any machine that this is run
print "Number of contigs identified is %s" %len(contigs)

On machine 1, this runs to return:

WARNING:root:OS platform is Linux-3.10.10-031010-generic-x86_64-with-Ubuntu-12.04-precise   
WARNING:root:Python version is 2.7.3 (default, Jun 22 2015, 19:33:41)  
[GCC 4.6.3]  
WARNING:root:BeautifulSoup is at /usr/local/lib/python2.7/dist-packages/bs4/__init__.pyc and its version is 4.3.2  
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf  

Number of contigs identified is 630

On machine 2, this very identical code runs to return:

WARNING:root:OS platform is Linux-2.6.32-431.46.2.el6.nersc.x86_64-x86_64-with-debian-6.0.6
WARNING:root:Python version is 2.7.4 (default, Apr 17 2013, 10:26:13) 
[GCC 4.6.3]
WARNING:root:BeautifulSoup is at /global/homes/i/img/.local/lib/python2.7/site-packages/bs4/__init__.pyc and its version is 4.3.2
MD5 sum for html string downloaded is ca76b381df706a2d6443dd76c9d27adf

Number of contigs identified is 462

The number of contigs calculated is different. Please note that the same code parses an HTML table to yield different results on two different environments that are not strikingly different from each other and unfortunately leading to this production nightmare. Manual inspection confirms that the results returned on Machine 2 are incorrect, but has so far been impossible to explain.

Does anyone have similar experience? Do you notice anything wrong with this code or should I stop trusting BeautifulSoup altogether?

alecxe · Accepted Answer

You are experiencing the differences between parsers that BeaufitulSoup chooses automatically for the "html" markup type you've specified. Which parser is picked up depends on what modules are available in the current Python environment:

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

To have a consistent behavior across the platforms, be explicit:

soup = BeautifulSoup(html, "html.parser")
soup = BeautifulSoup(html, "html5lib")
soup = BeautifulSoup(html, "lxml")

beautifulSoup inconsistent behavior

On machine 1, this runs to return:

On machine 2, this very identical code runs to return:

Answers (1)

Related Questions