lifeotheparty
lifeotheparty

Reputation: 15

Is there a way to get the URL that a link is scraped from?

I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.

For example:

www.example.com/product/123 was found on www.example.com/page/2.

When scrapy scrapes information from /product/123 I want to have a field that is "Scraped From" and return /page/2. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!

Upvotes: 1

Views: 112

Answers (1)

ThePyGuy
ThePyGuy

Reputation: 1035

The easiest way is to use the response.headers. There should be a referer header.

referer = response.headers['Referer']

You can also use meta to pass information along to the next URL.

def parse(self, response):
    product_url = response.css('#url').get()
    yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})

def parse_product(self, response):
    referer = response.meta['referer']
    item = ItemName()
    item['referer'] = referer
    yield item

Upvotes: 1

Related Questions