Reputation: 395
For Scrapy, we could get the response.url, response.request.url, but how do we know the response.url, response.request.url is extracted from which parent url?
Thank you, Ken
Upvotes: 1
Views: 1071
Reputation: 3847
You can use Request.meta to keep track of such information.
When you yield your request, include response.url
in the meta:
yield response.follow(link, …, meta={'source_url': response.url})
Then read it on your parsing method:
source_url = response.meta['source_url']
That is the most straightforward way to do this, and you can use this method to keep track of original URLs even across different parsing methods, if you wish.
Otherwise, you might want to look into taking advantage of the redirect_urls
meta key, which keeps track of redirect jumps.
Upvotes: 4