Scrapy callback function, how to parse several pages?

Question

I want to make a crawler that starts at a url (page1), and follows a link to a new page, page2. On page2 it should follow a link to page3. Then I want to scrape some data on page3.

However, I'm a noob at scraping and can't get the callback function to work. Here's my code:

class allabolagnewspider(CrawlSpider):
name="allabolagnewspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
    "http://www.allabolag.se/5565794400/befattningar"
]

rules = (
    Rule(LinkExtractor(allow = "http://www.allabolag.se",
                       restrict_xpaths=('//*[@id="printContent"]//a[1]'),
                       canonicalize=False),
         callback='parse_link1'),
)

def parse_link1(self, response):
    hxs = HtmlXPathSelector(response)
    return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract(), callback=self.parse_link2)

def parse_link2(self, response):
    for sel in response.xpath('//*[@id="printContent"]'):
        item = AllabolagnewItem()
        item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        yield item

However, when I run it I get the following error message: "TypeError: Request url must be str or unicode, got list:"

If I got it correctly I mess up when I try to return my request for parse_link1. What should I do?

Edit:

Here's the working code (still got a few issues though but the specific problem was solved):

class allabolagnewspider(CrawlSpider):
name="allabolagnewspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
    "http://www.allabolag.se/5565794400/befattningar"
]

rules = (
    Rule(LinkExtractor(allow = "http://www.allabolag.se",
                       restrict_xpaths=('//*[@id="printContent"]//a[2]'),
                       canonicalize=False),
         callback='parse_link1'),
)

def parse_link1(self, response):
    for href in response.xpath('''//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a/@href''').extract():
        print "hey"
        yield Request(response.urljoin(href), callback=self.parse_link2)

def parse_link2(self, response):
    for sel in response.xpath('//*[@id="printContent"]'):
        print "hey2"
        item = AllabolagnewItem()
        item['Byra'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        item['Namn'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        item['Gender'] = sel.xpath('./div[2]/table//tr[7]/td/table[1]//tr[1]/td/text()').extract()
        item['Alder'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        yield item

paul trmbrth · Accepted Answer

In parse_link1, you're passing a list, the result of .extract() on a SelectorList (result of calling .xpath() on the hxs selector), as value for url, the first argument of Request constructor, while a single value is expected.

Use .extract_first() instead:

return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract_first()

Edit after OP's comment on

"TypeError: Request url must be str or unicode, got NoneType:"

This is due to a "too-conservative" XPath expression, probably given by your browser Inspect tools I guess (I tested your XPath in Chrome and it works for this example page)

The trouble is with .../table/tbody/tr/.... The thing is is rarely there for real HTML pages written by people or even templates (written by people). HTML wants a

to have a but nobody really cares, and browsers cope fine (and they inject the missing element to host the rows.)

So, although it's not strictly equivalent XPath, it's usually fine to:

either omit tbody/ and use the table/tr pattern
or use table//tr

See it in action with scrapy shell:

$ scrapy shell http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b
>>>
>>> # with XPath from browser tool (I assume), you get nothing for the "real" downloaded HTML 
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a')
[]
>>>
>>> # or, omitting `tbody/`
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a')
[>> response.xpath('//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a')
[@href at the end of your XPath)
build an absolute URL. response.urljoin() is a handy shortcut for this


Continuing in scrapy shell:

>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a/@href').extract_first()
u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>> response.urljoin(u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
u'http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>> 


In the end, your callback could become:

def parse_link1(self, response):
    # .extract() returns a list here, after .xpath()
    # so you can loop, even if you have 1 result
    #
    # XPaths can be multiline, it's easier to read for long expressions
    for href in response.xpath('''
        //*[@id="printContent"]
           /div[2]
            /table//tr[4]/td
             /table//tr/td[2]/a/@href''').extract():
        yield Request(response.urljoin(href),
                      callback=self.parse_link2)

Scrapy callback function, how to parse several pages?

Answers (2)

Related Questions