Reputation: 5380
I have a crawler that works just fine in collecting the urls I am interested in. However, before retrieving the content of these urls (i.e. the ones that satisfy rule no 3), I would like to update them, i.e. add a suffix - say '/fullspecs' - on the right-hand side. That means that, in fact, I would like to retrieve and further process - through callback function - only the updated ones. How can I do that?
rules = (
Rule(LinkExtractor(allow=('something1'))),
Rule(LinkExtractor(allow=('something2'))),
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'),
)
Upvotes: 2
Views: 82
Reputation: 1846
You can set process_value
parameter to lambda x: x+'/fullspecs'
or to a function if you want to do something more complex.
You'd end up with:
Rule(LinkExtractor(allow=('something3'), deny=('something4', 'something5')),
callback='parse_archive', process_value=lambda x: x+'/fullspecs')
See more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor
Upvotes: 1