Reputation: 867
I have the following Scrapy spider to get the status of the pages from the list of urls in the file url.txt
import scrapy
from scrapy.contrib.spiders import CrawlSpider
from pegasLinks.items import StatusLinkItem
class FindErrorsSpider(CrawlSpider):
handle_httpstatus_list = [404,400,401,500]
name = "findErrors"
allowed_domains = ["domain-name.com"]
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
item = StatusLinkItem()
item['url'] = response.url
item['status'] = response.status
yield item
Here's my items.py file:
import scrapy
class StatusLinkItem(scrapy.Item):
url = scrapy.Field()
status = scrapy.Field()
I use the following command to get the output of items in CSV:
scrapy crawl findErrors -o File.csv
The order of items in the ouput file is different from the order of corresponding urls in urls.txt file. How can I retain the original order or add another field to the items.py with some kind of global variable, which will represent the id of the urls, with which I will be able to restore the original order later?
Upvotes: 2
Views: 1333
Reputation: 25349
You can not rely on order or urls in start_urls
.
You can do the following thing. Override start_requests
method in your spider to add something like index
parameter into meta
dictionary in created Request
objects.
def start_requests(self):
for index, url in enumerate(self.start_urls):
yield Request(url, dont_filter=True, meta={'index': index})
Later you can access meta
in your parse
function by using response.meta
.
Upvotes: 3