mjhd
mjhd

Reputation: 65

Scrapy send variable along with URL to spider

I'm using https://github.com/rolando/scrapy-redis to create a spider that reads URLs form a Redis list. The problem I have is that I want to send a unique ID along side of each URL. So that I can identify the entry in the db again.

My list in redis looks like this: http://google.com[someuniqueid] http://example.com[anotheruniqueid]

Scrapy-redis per default reads only an url from redis that then is sent to the spider.

I modified inside: https://github.com/rolando/scrapy-redis/blob/master/scrapy_redis/spiders.py

And changed this function:

def next_request(self):
    """Returns a request to be scheduled or none."""
    url = self.server.lpop(self.redis_key)
    if url:
        mm = url.split("[")
        self.guid = mm[1].replace("]", "")
        return self.make_requests_from_url(mm[0])

This works, I can get the guid inside my spider by calling:

print self.guid

The problem however is that it seems to mix up the guid's. I dont always have the correct guid for each url.

How should I send the guid to my spider?

Upvotes: 0

Views: 513

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

This happends because scrapy is asynchronious and you are storing asynchronious data in a object variable so you can't rely on it. There are few ways to approach this. The most common would be:

use scrapy.Request with meta={'guid': guid} argument.
replace this line:

return self.make_requests_from_url(mm[0])

with:

return scrapy.Request(mm[0], meta={'guid': mm[1].replace("]", "")}

and now in your parse() you can access the guid with:

def parse(self, response):
    guid = response.meta['guid']

Upvotes: 2

Related Questions