How to add a url suffix before performing a callback in scrapy

Question

I have a crawler that works just fine in collecting the urls I am interested in. However, before retrieving the content of these urls (i.e. the ones that satisfy rule no 3), I would like to update them, i.e. add a suffix - say '/fullspecs' - on the right-hand side. That means that, in fact, I would like to retrieve and further process - through callback function - only the updated ones. How can I do that?

rules = (
        Rule(LinkExtractor(allow=('something1'))),
        Rule(LinkExtractor(allow=('something2'))),
        Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'),
)

marven · Accepted Answer

You can set process_value parameter to lambda x: x+'/fullspecs' or to a function if you want to do something more complex.

You'd end up with:

Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')),
     callback='parse_archive', process_value=lambda x: x+'/fullspecs')

See more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor

How to add a url suffix before performing a callback in scrapy

Answers (1)

Related Questions