Scrapy problems with multiple urls

Question

I'm scraping data from multiple urls, this way:

import scrapy

from pogba.items import PogbaItem

class DmozSpider(scrapy.Spider):
    name = "pogba"
    allowed_domains = ["fourfourtwo.com"]
    start_urls = [
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459525/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459571/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459585/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459614/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459635/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459644/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459662/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459674/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459686/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459694/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459705/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459710/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459737/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459744/player-stats/74208/OVERALL_02",
        "http://www.fourfourtwo.com/statszone/21-2012/matches/459765/player-stats/74208/OVERALL_02"
    ]

    def parse(self, response):
        Coords = []
        for sel in response.xpath('//*[@id="pitch"]/*[contains(@class,"success")]'):
            item = PogbaItem()
            item['x'] = sel.xpath('(@x|@x1)').extract() 
            item['y'] = sel.xpath('(@y|@y1)').extract() 
            Coords.append(item)
        return Coords

Tha problem is that with this situation i have a csv with about 200 lines, while for each url i have about 50 line. Scraping one url at time works fine, but why i have different results if i set multiple urls?

alecxe · Accepted Answer

I would try adjusting the crawling speed and slow down a little bit by increasing the delay between the requests (DOWNLOAD_DELAY setting) and decreasing the amount of concurrent requests (CONCURRENT_REQUESTS setting), e.g.:

DOWNLOAD_DELAY = 1
CONCURRENT_REQUESTS = 4

Scrapy problems with multiple urls

Answers (1)

Related Questions