Reputation: 65
I'm using https://github.com/rolando/scrapy-redis to create a spider that reads URLs form a Redis list. The problem I have is that I want to send a unique ID along side of each URL. So that I can identify the entry in the db again.
My list in redis looks like this: http://google.com[someuniqueid] http://example.com[anotheruniqueid]
Scrapy-redis per default reads only an url from redis that then is sent to the spider.
I modified inside: https://github.com/rolando/scrapy-redis/blob/master/scrapy_redis/spiders.py
And changed this function:
def next_request(self):
"""Returns a request to be scheduled or none."""
url = self.server.lpop(self.redis_key)
if url:
mm = url.split("[")
self.guid = mm[1].replace("]", "")
return self.make_requests_from_url(mm[0])
This works, I can get the guid inside my spider by calling:
print self.guid
The problem however is that it seems to mix up the guid's. I dont always have the correct guid for each url.
How should I send the guid to my spider?
Upvotes: 0
Views: 513
Reputation: 21446
This happends because scrapy is asynchronious and you are storing asynchronious data in a object variable so you can't rely on it. There are few ways to approach this. The most common would be:
use scrapy.Request
with meta={'guid': guid}
argument.
replace this line:
return self.make_requests_from_url(mm[0])
with:
return scrapy.Request(mm[0], meta={'guid': mm[1].replace("]", "")}
and now in your parse()
you can access the guid with:
def parse(self, response):
guid = response.meta['guid']
Upvotes: 2