Reputation: 4084
I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following:
<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag because they have attributes. In addition to this, I'm not sure whether I should rely more on BeautifulSoup or lxml or a combination of both. What should I do?
At the moment, my code looks like what is below:
import re
import urllib2,sys
import lxml
from lxml import etree
from lxml.html.soupparser import fromstring
from lxml.etree import tostring
from lxml.cssselect import CSSSelector
from BeautifulSoup import BeautifulSoup, NavigableString
address='http://www.example.com/'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
html=soup.prettify()
html=html.replace(' ', ' ')
html=html.replace('í','í')
root=fromstring(html)
I realize that a lot of the import statements may be redundant, but I just copied whatever I currently had in more source file.
EDIT: I suppose that I didn't make this quite clear, but I have multiple tags in page that I want to scrape.
Upvotes: 8
Views: 9944
Reputation: 158
The lxml library is now the standard for parsing html in python. The interface can seem awkward at first, but it is very serviceable for what it does.
You should let the libary handle the xml specialism, such as those escaped &entities;
import lxml.html
html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
<td class="author">####I want whatever is located here, eh? í ###</td>
</tr></tbody></table></div></div></body></html>"""
root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")
print tds # gives [<Element td at 84ee2cc>]
print tds[0].text # what you want, including the 'í'
Upvotes: 4
Reputation: 645
or you could be using pyquery, since BeautifulSoup is not actively maintained anymore, see http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
first, install pyquery with
easy_install pyquery
then your script could be as simple as
from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]
pyquery uses the css selector syntax familiar from jQuery which I find more intuitive than BeautifulSoup's. It uses lxml underneath, and is much faster than BeautifulSoup. But BeautifulSoup is pure python, and thus works on Google's app engine as well
Upvotes: 6
Reputation: 63747
BeautifulSoup is certainly the canonical HTML parser/processor. But if you have just this kind of snippet you need to match, instead of building a whole hierarchical object representing the HTML, pyparsing makes it easy to define leading and trailing HTML tags as part of creating a larger search expression:
from pyparsing import makeHTMLTags, withAttribute, SkipTo
author_td, end_td = makeHTMLTags("td")
# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))
search = author_td + SkipTo(end_td)("body") + end_td
for match in search.searchString(html):
print match.body
Pyparsing's makeHTMLTags function does a lot more than just emit "<tag>"
and "</tag>"
expressions. It also handles:
"<tag/>"
syntaxThese are the common pitfalls when considering using a regex for HTML scraping.
Upvotes: 1
Reputation: 882171
It's not clear to me from your question why you need to worry about the div
tags -- what about doing just:
soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string
On the HTML you give, running this emits exactly:
####I want whatever is located here ###
which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td
tags all of class author
of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).
Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:
thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
print thetd.string
...i.e., not much harder at all!-)
Upvotes: 12