Ayohaych
Ayohaych

Reputation: 5189

Scrapy code not iterating through pages

Completely new to Scrapy and Python so not sure what the problem here is. Basically trying to iterate through each page and store the title on each page.

This is the code that isn't working. It gets the first page fine, but just prints empty titles for the rest.

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from metacritic.items import MetacriticItem

class MetacriticSpider(BaseSpider):
    name = "metacritic"
    allowed_domains = ["metacritic.com"]
    max_id = 5
    start_urls = [
        "http://www.metacritic.com/browse/games/title/ps3?page="
        #"http://www.metacritic.com/browse/games/title/xbox360?page=0"
    ]

    def start_requests(self):
        for i in range(self.max_id):
            yield Request('http://www.metacritic.com/browse/games/title/ps3?page=%d' % i, callback = self.parse)

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//ol/li/div/div')
        items = []

        for site in sites:
            item = MetacriticItem()
            item['title'] = site.xpath('a/text()').extract()

            items.append(item)
        return items

Upvotes: 0

Views: 1534

Answers (1)

seikichi
seikichi

Reputation: 1281

I think that the URL http://www.metacritic.com/browse/games/title/ps3?page=%d' % i is wrong. Try to open the url http://www.metacritic.com/browse/games/title/ps3?page=1 and you will see the message: "No Results Found".

The correct URL seems to be 'http://www.metacritic.com/browse/games/title/ps3/%c?page=%d' % (c, i) where c is a lowercase character (ex1, ex2). So I modified your code as follows. How about this code?

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from metacritic.items import MetacriticItem
from string import lowercase


class MetacriticSpider(BaseSpider):
    name = "metacritic"
    allowed_domains = ["metacritic.com"]
    max_id = 5

    def start_requests(self):
        for c in lowercase:
            for i in range(self.max_id):
                yield Request('http://www.metacritic.com/browse/games/title/ps3/{0}?page={1}'.format(c, i),
                              callback=self.parse)

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//div[@class="product_wrap"]/div')
        items = []

        for site in sites:
            titles = site.xpath('a/text()').extract()
            if titles:
                item = MetacriticItem()
                item['title'] = titles[0].strip()
                items.append(item)
        return items

Upvotes: 3

Related Questions