Advice extracting //td text and numbers

Question

I have been working through the tutorial adapting it to a project I want to achieve. I seem to have something going wrong that i just can't find the error to.

When using 'scrapy shell' I can get the response I expect. So for this site Nrl Ladder

In [1]: hxs.select('//td').extract()
Out[1]: 
[u'

Home
NRL
NYC
Rep Matches


',
 u'Round 4',
 u'Updated: 26/3/2012',
 u'1. Melbourne',
 u'4',
 u'4',
 u'0',
 u'0',
 u'0',
 u'122',
 u'39',
 u'83',
 u'8',
 u'2. Canterbury-Bankstown',

And on it goes.

I am really struggling to understand how to alter the tutorial project to change it to a different data type.

Is there anyway to bring up a help or documentation list to see what types I should use in items when using 'td' or any other item. Like i say it works easy in the shell but I cannot transform it to the files. Specifically both the team names and the points are 'td' but the team name is text.

here is what I have done.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from nrl.items import NrlItem

class nrl(BaseSpider):
    name = "nrl"
    allowed_domains = ["http://live.nrlstats.com/"]
    start_urls = [
        "http://live.nrlstats.com/nrl/ladder.html",
        ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//td')
        items = []
        for site in sites:
           item = nrlItem()
           item['team'] = site.select('/text()').extract()
           item['points'] = site.select('/').extract()
           items.append(item)
        return items

warvariuc · Accepted Answer

I didn't quite understand your question, but here is a starting point, imo (haven't tested; see some comments in the code):

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from nrl.items import NrlItem

class nrl(BaseSpider):
    name = "nrl"
    allowed_domains = ["live.nrlstats.com"] # domains should be like this
    start_urls = [
        "http://live.nrlstats.com/nrl/ladder.html",
        ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tabler"]//tr[starts-with(@class, "r")]') # select team rows
        items = []
        for row in rows:
           item = nrlItem()
           columns = row.select('./td/text()').extract() # select columns for the selected row
           item['team'] = columns[0]
           item['P'] = int(columns[1])
           item['W'] = int(columns[2])
           ...
           items.append(item)
        return items

UPDATE:

//table[@class="tabler"//tr[starts-with(@class, "r")] is an xpath query. See some xpath examples here.

hxs.select(xpath_query) always returns a list of nodes (also of type HtmlXPathSelector) which fall under the given query.

hxs.extract() returns string representation of the node(s).

P.S. Beware that scrapy supports XPath 1.0, but not 2.0 (at least on Linux, not sure about Windows), so some of the newest xpath features might not work.

Advice extracting //td text and numbers

Answers (1)

Related Questions