Is there a way to get the URL that a link is scraped from?

Question

I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.

For example:

www.example.com/product/123 was found on www.example.com/page/2.

When scrapy scrapes information from /product/123 I want to have a field that is "Scraped From" and return /page/2. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!

ThePyGuy · Accepted Answer

The easiest way is to use the response.headers. There should be a referer header.

referer = response.headers['Referer']

You can also use meta to pass information along to the next URL.

def parse(self, response):
    product_url = response.css('#url').get()
    yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})

def parse_product(self, response):
    referer = response.meta['referer']
    item = ItemName()
    item['referer'] = referer
    yield item

Is there a way to get the URL that a link is scraped from?

Answers (1)

Related Questions