Scrapy only scraping first result of each page

Question

I'm currently trying to run the following code but it keeps scraping only the first result of each page. Any idea what the issue may be?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from firstproject.items import xyz123Item
import urlparse
from scrapy.http.request import Request

class MySpider(CrawlSpider):
    name = "xyz123"
    allowed_domains = ["www.xyz123.com.au"]
    start_urls = ["http://www.xyz123.com.au/",]

    rules = (Rule (SgmlLinkExtractor(allow=("",),restrict_xpaths=('//*[@id="1234headerPagination_hlNextLink"]',))
    , callback="parse_xyz", follow=True),
    )

    def parse_xyz(self, response):
        hxs = HtmlXPathSelector(response)
        xyz = hxs.select('//div[@id="1234SearchResults"]//div/h2')
        items = []
        for xyz in xyz:
            item = xyz123Item()
            item ["title"] = xyz.select('a/text()').extract()[0]
            item ["link"] = xyz.select('a/@href').extract()[0]
            items.append(item)
            return items

The Basespider version works well scraping ALL the required data on the first page:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from firstproject.items import xyz123

class MySpider(BaseSpider):
    name = "xyz123test"
    allowed_domains = ["xyz123.com.au"]
    start_urls = ["http://www.xyz123.com.au/"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//div[@id="1234SearchResults"]//div/h2')
        items = []
        for titles in titles:
            item = xyz123Item()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

Sorry for the censoring. I had to censor the website for privacy reasons.

The first code crawls through the pages well the way I'd like it to crawl, however it only pulls the first item title and link. NOTE: The XPath of the first title using "inspect element" in google is:
//*[@id="xyz123SearchResults"]/div[1]/h2/a,
second is //*[@id="xyz123SearchResults"]/div[2]/h2/a
third is //*[@id="xyz123SearchResults"]/div[3]/h2/a etc.

I'm not sure if the div[n] bit is what's killing it. I'm hoping it's an easy fix.

Thanks

Biswanath · Accepted Answer

 for xyz in xyz:
            item = xyz123Item()
            item ["title"] = xyz.select('a/text()').extract()[0]
            item ["link"] = xyz.select('a/@href').extract()[0]
            items.append(item)
            return items

Are you sure about the indentation of the return items ? It should be one less.

Scrapy only scraping first result of each page

Answers (1)

Related Questions