XPath: Select Certain Child Nodes

Question

I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.

As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.

Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.

Using this as an example. I will need actor/s, director/s and producer/s.

This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()

However, would a link exist to a page on the director, this would be the Xpath:

 //div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()

Actors are a bit more tricky, as there included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:

//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()

Gets all most all of the actors (except those with font/br).

Now, the main problem here, I believe, is that there are multiple //div[@class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:

 actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
 try:
     second = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
     for n in second:
         actors.append(n)
 except:
     actors = hxs.select('//div[@class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()

This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.

I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.

paul trmbrth · Accepted Answer

I am assuming you are only interested in textual content, not the links to actors' pages etc.

Here is a proposition using lxml.html (and a bit of lxml.etree) directly

First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without , with or without etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements

And for the issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to elements' .text or .tail, for example a , with one of lxml's iter() functions. This can be useful on other HTML block elements, like

for example.

You may see better what I mean with some spider code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html

MARKER = "|"
def br2nl(tree):
    for element in tree:
        for elem in element.iter("br"):
            elem.text = MARKER

def extract_category_lines(tree):
    if tree is not None and len(tree):
        # modify the tree by adding a MARKER after 
 elements
        br2nl(tree)

        # use lxml's .tostring() to get a unicode string
        # and split lines on the marker we added above
        # so we get lists of actors, producers, directors...
        return lxml.html.tostring(
            tree[0], method="text", encoding=unicode).split(MARKER)

class BoxOfficeMojoSpider(BaseSpider):
    name = "boxofficemojo"
    start_urls = [
        "http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
        "http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
    ]

    # locate 2nd cell by text content of first cell
    XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # locate the "The Players" table
        players = root.xpath('//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table')

        # we have only one table in "players" so the for loop is not really necessary
        for players_table in players:

            directors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Director")
            actors_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Actor")
            producers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            writers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Producer")
            composers_cells = self.XPATH_CATEGORY_CELL(players_table,
                category="Composer")

            directors = extract_category_lines(directors_cells)
            actors = extract_category_lines(actors_cells)
            producers = extract_category_lines(producers_cells)
            writers = extract_category_lines(writers_cells)
            composers = extract_category_lines(composers_cells)

            print "Directors:", directors
            print "Actors:", actors
            print "Producers:", producers
            print "Writers:", writers
            print "Composers:", composers
            # here you should of course populate scrapy items

The code can be simplified for sure, but I hope you get the idea.

You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for (how to do that with hxs?) it works only for non-multiple names in your case:

>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[@class="mp_box"][div[@class="mp_box_tab"]="The Players"]/div[@class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']

XPath: Select Certain Child Nodes

Answers (1)

Related Questions