user706838
user706838

Reputation: 5380

How to add a url suffix before performing a callback in scrapy

I have a crawler that works just fine in collecting the urls I am interested in. However, before retrieving the content of these urls (i.e. the ones that satisfy rule no 3), I would like to update them, i.e. add a suffix - say '/fullspecs' - on the right-hand side. That means that, in fact, I would like to retrieve and further process - through callback function - only the updated ones. How can I do that?

rules = (
        Rule(LinkExtractor(allow=('something1'))),
        Rule(LinkExtractor(allow=('something2'))),
        Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'),
)

Upvotes: 2

Views: 82

Answers (1)

marven
marven

Reputation: 1846

You can set process_value parameter to lambda x: x+'/fullspecs' or to a function if you want to do something more complex.

You'd end up with:

Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')),
     callback='parse_archive', process_value=lambda x: x+'/fullspecs')

See more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor

Upvotes: 1

Related Questions