Reputation: 15
I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.
For example:
www.example.com/product/123
was found on www.example.com/page/2
.
When scrapy scrapes information from /product/123
I want to have a field that is "Scraped From" and return /page/2
. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!
Upvotes: 1
Views: 112
Reputation: 1035
The easiest way is to use the response.headers. There should be a referer header.
referer = response.headers['Referer']
You can also use meta to pass information along to the next URL.
def parse(self, response):
product_url = response.css('#url').get()
yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})
def parse_product(self, response):
referer = response.meta['referer']
item = ItemName()
item['referer'] = referer
yield item
Upvotes: 1