Reputation: 1452
I'm using XPath
with Scrapy
to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath
string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath
to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director
is found at [1]
and actors
at [2]
.
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath
:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br>
included for subsequent actors listed, which may be the children of an /a
or children of the parent /font
, so:
//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br
).
Now, the main problem here, I believe, is that there are multiple //div[@class="mp_box_content"]
- everything I have works EXCEPT that I also end up getting some digits from other mp_box_content
. Also I have added numerous try:
, except:
statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy
code for actors:
actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.
Upvotes: 1
Views: 4207
Reputation: 20748
I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html
(and a bit of lxml.etree
) directly
First, I recommend you select td[2]
cells by the text content of td[1]
, with expressions like .//tr[starts-with(td[1], "Director")]/td[2]
to account for "Director", or "Directors"
Second, testing various expressions with or without <font>
, with or without <a>
etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2])
to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode)
on selected elements
And for the <br>
issue for multiple names, the way I do is generally modify the lxml
tree containing the targetted content to add a special formatting character to <br>
elements' .text
or .tail
, for example a \n
, with one of lxml
's iter()
functions. This can be useful on other HTML block elements, like <hr>
for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector
of course (with the string()
XPath function for example), but without modifying the tree for <br>
(how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']
Upvotes: 3