brrrglund
brrrglund

Reputation: 51

Scrapy callback function, how to parse several pages?

I want to make a crawler that starts at a url (page1), and follows a link to a new page, page2. On page2 it should follow a link to page3. Then I want to scrape some data on page3.

However, I'm a noob at scraping and can't get the callback function to work. Here's my code:

class allabolagnewspider(CrawlSpider):
name="allabolagnewspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
    "http://www.allabolag.se/5565794400/befattningar"
]

rules = (
    Rule(LinkExtractor(allow = "http://www.allabolag.se",
                       restrict_xpaths=('//*[@id="printContent"]//a[1]'),
                       canonicalize=False),
         callback='parse_link1'),
)

def parse_link1(self, response):
    hxs = HtmlXPathSelector(response)
    return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract(), callback=self.parse_link2)

def parse_link2(self, response):
    for sel in response.xpath('//*[@id="printContent"]'):
        item = AllabolagnewItem()
        item['Byra'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Namn'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Gender'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        item['Alder'] = sel.xpath('/div[2]/table/tbody/tr[3]/td/h1').extract()
        yield item

However, when I run it I get the following error message: "TypeError: Request url must be str or unicode, got list:"

If I got it correctly I mess up when I try to return my request for parse_link1. What should I do?

Edit:

Here's the working code (still got a few issues though but the specific problem was solved):

class allabolagnewspider(CrawlSpider):
name="allabolagnewspider"
# allowed_domains = ["byralistan.se"]
start_urls = [
    "http://www.allabolag.se/5565794400/befattningar"
]

rules = (
    Rule(LinkExtractor(allow = "http://www.allabolag.se",
                       restrict_xpaths=('//*[@id="printContent"]//a[2]'),
                       canonicalize=False),
         callback='parse_link1'),
)

def parse_link1(self, response):
    for href in response.xpath('''//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a/@href''').extract():
        print "hey"
        yield Request(response.urljoin(href), callback=self.parse_link2)

def parse_link2(self, response):
    for sel in response.xpath('//*[@id="printContent"]'):
        print "hey2"
        item = AllabolagnewItem()
        item['Byra'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        item['Namn'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        item['Gender'] = sel.xpath('./div[2]/table//tr[7]/td/table[1]//tr[1]/td/text()').extract()
        item['Alder'] = sel.xpath('./div[2]/table//tr[3]/td/h1/text()').extract()
        yield item

Upvotes: 0

Views: 211

Answers (2)

paul trmbrth
paul trmbrth

Reputation: 20748

In parse_link1, you're passing a list, the result of .extract() on a SelectorList (result of calling .xpath() on the hxs selector), as value for url, the first argument of Request constructor, while a single value is expected.

Use .extract_first() instead:

return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract_first()

Edit after OP's comment on

"TypeError: Request url must be str or unicode, got NoneType:"

This is due to a "too-conservative" XPath expression, probably given by your browser Inspect tools I guess (I tested your XPath in Chrome and it works for this example page)

The trouble is with .../table/tbody/tr/.... The thing is <tbody> is rarely there for real HTML pages written by people or even templates (written by people). HTML wants a <table> to have a <tbody> but nobody really cares, and browsers cope fine (and they inject the missing <tbody> element to host the <tr> rows.)

So, although it's not strictly equivalent XPath, it's usually fine to:

  • either omit tbody/ and use the table/tr pattern
  • or use table//tr

See it in action with scrapy shell:

$ scrapy shell http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b
>>>
>>> # with XPath from browser tool (I assume), you get nothing for the "real" downloaded HTML 
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a')
[]
>>>
>>> # or, omitting `tbody/`
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a')
[<Selector xpath='//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>]

>>> # replacing "/table/tbody/" with "/table//" (tbody is added by browser to have "correct DOM tree")
>>> response.xpath('//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a')
[<Selector xpath='//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>]
>>>
>>> # suggestion: use the <img> tag after the <a> as predicate
>>> response.xpath('//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]')
[<Selector xpath='//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]' data=u'<a href="/befattningshavare/de_Sauvage-N'>]
>>> 

Also, you need:

  • to get the "href" attribute value (adding @href at the end of your XPath)
  • build an absolute URL. response.urljoin() is a handy shortcut for this

Continuing in scrapy shell:

>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a/@href').extract_first()
u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>> response.urljoin(u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
u'http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>> 

In the end, your callback could become:

def parse_link1(self, response):
    # .extract() returns a list here, after .xpath()
    # so you can loop, even if you have 1 result
    #
    # XPaths can be multiline, it's easier to read for long expressions
    for href in response.xpath('''
        //*[@id="printContent"]
           /div[2]
            /table//tr[4]/td
             /table//tr/td[2]/a/@href''').extract():
        yield Request(response.urljoin(href),
                      callback=self.parse_link2)

Upvotes: 1

Djunzu
Djunzu

Reputation: 498

hxs.xpath(...).extract() is returning a list and not a string. Try to iterate over the list yielding requests or select the correct url you want from the list.

After that, it will work only if links in the page are absolute paths. If they are relative, you need to build the absolute path.

Upvotes: 0

Related Questions