Scrapy: retain original order of scraped items in the output

Question

I have the following Scrapy spider to get the status of the pages from the list of urls in the file url.txt

import scrapy
from scrapy.contrib.spiders import CrawlSpider
from pegasLinks.items import StatusLinkItem

class FindErrorsSpider(CrawlSpider):
    handle_httpstatus_list = [404,400,401,500]
    name = "findErrors"

    allowed_domains = ["domain-name.com"]
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()

    def parse(self, response):
        item = StatusLinkItem()
        item['url'] = response.url
        item['status'] = response.status
        yield item

Here's my items.py file:

import scrapy

class StatusLinkItem(scrapy.Item):
    url = scrapy.Field()
    status = scrapy.Field()

I use the following command to get the output of items in CSV:

scrapy crawl findErrors -o File.csv

The order of items in the ouput file is different from the order of corresponding urls in urls.txt file. How can I retain the original order or add another field to the items.py with some kind of global variable, which will represent the id of the urls, with which I will be able to restore the original order later?

Konstantin · Accepted Answer

You can not rely on order or urls in start_urls.

You can do the following thing. Override start_requests method in your spider to add something like index parameter into meta dictionary in created Request objects.

def start_requests(self):
    for index, url in enumerate(self.start_urls):
        yield  Request(url, dont_filter=True, meta={'index': index})

Later you can access meta in your parse function by using response.meta.

Scrapy: retain original order of scraped items in the output

Answers (1)

Related Questions