Reputation: 2779
Here is a simple code in python 2.7.2, which fetches site and gets all links from given site:
import urllib2
from bs4 import BeautifulSoup
def getAllLinks(url):
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html5lib")
return soup.find_all("a")
links1 = getAllLinks('http://www.stanford.edu')
links2 = getAllLinks('http://med.stanford.edu/')
print len(links1)
print len(links2)
Problem is that it doesn't work in second scenario. It prints 102 and 0, while there are clearly links on the second site. BeautifulSoup doesn't throw parsing errors and it pretty prints markup ok. I suspect it maybe caused by first line from source of med.stanford.edu which says that it's xml (even though content-type is: text/html):
<?xml version="1.0" encoding="iso-8859-1"?>
I can't figure out how to set up Beautiful to disregard it, or workaround. I'm using html5lib as parser because I had problems with default one (incorrect markup).
Upvotes: 2
Views: 1724
Reputation: 4164
When a document claims to be XML, I find the lxml parser gives the best results. Trying your code but using the lxml parser instead of html5lib finds the 300 links.
Upvotes: 3
Reputation: 78660
You are precisely right that the problem is the <?xml...
line. Disregarding it is very simple: just skip the first line of content, by replacing
content = response.read()
with something like
content = "\n".join(response.readlines()[1:])
Upon this change, len(links2)
becomes 300.
ETA: You probably want to do this conditionally, so you don't always skip the first line of content. An example would be something like:
content = response.read()
if content.startswith("<?xml"):
content = "\n".join(content.split("\n")[1:])
Upvotes: 2