user1060859
user1060859

Reputation: 27

Extracting links from HTML table using BeautifulSoup with unclean source code

I am trying to scrape articles from a Chinese newspaper database. Here is some of the source code (pasting excerpt b/c keyed site):

<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web\" /><html>
<! -- <%@ page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>  
</html>

When I try to do some straightforward scraping of the links in the table like so:

import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)

url = 'http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380'
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll('a') # links is empty =(

Python does not even find anything in the html, aka returns an empty list. I think this is because the source code starts with the base href tag, and Python only recognizes two tags in the document: base href and html.

Any idea how to scrape the links in this case? Thank you so much!!

Upvotes: 2

Views: 860

Answers (3)

jan zegan
jan zegan

Reputation: 1657

Removing the second line made BS find all the tags. I didn't find a better way to parse this.

page = br.open(url)
page = page.read().replace('<! -- <%@ page contentType="text/html;charset=GBK" %>', '')
soup = BeautifulSoup(page)

Upvotes: 1

malangi
malangi

Reputation: 2770

BS isnt really developed any longer - and would suggest you have a look at lxml

Dont have access to that specific url, but I was able to get this to work, using the html fragment (to which I added an a tag)

>>> soup = lxml.html.document_fromstring(u)
>>> soup.cssselect('a')
>>> soup.cssselect('a')[0].text_content() #for example

Upvotes: 0

markijbema
markijbema

Reputation: 4055

When your html is very messed up, it's better to clean it up a little first, for instance, in this case, remove everything before , remove everything after (the first) . Download one page, mold it manually to see what is acceptable to beautifulsoup, and then write some regexes to preprocess.

Upvotes: 0

Related Questions