BeautifulSoup incorrectly parses page and doesn't find links

Question

Here is a simple code in python 2.7.2, which fetches site and gets all links from given site:

import urllib2
from bs4 import BeautifulSoup

def getAllLinks(url):
    response = urllib2.urlopen(url)
    content = response.read()
    soup = BeautifulSoup(content, "html5lib")
    return soup.find_all("a")

links1 = getAllLinks('http://www.stanford.edu')
links2 = getAllLinks('http://med.stanford.edu/')

print len(links1)
print len(links2)

Problem is that it doesn't work in second scenario. It prints 102 and 0, while there are clearly links on the second site. BeautifulSoup doesn't throw parsing errors and it pretty prints markup ok. I suspect it maybe caused by first line from source of med.stanford.edu which says that it's xml (even though content-type is: text/html):

I can't figure out how to set up Beautiful to disregard it, or workaround. I'm using html5lib as parser because I had problems with default one (incorrect markup).

Leonard Richardson · Accepted Answer

When a document claims to be XML, I find the lxml parser gives the best results. Trying your code but using the lxml parser instead of html5lib finds the 300 links.

BeautifulSoup incorrectly parses page and doesn't find links

Answers (2)

Related Questions

BeautifulSoup incorrectly parses page and doesn&#39;t find links

Answers (2)

Related Questions

BeautifulSoup incorrectly parses page and doesn't find links