Reputation: 1113

how to get the original start_url in scrapy (before redirect)

I'm using Scrapy to crawl some pages. I fetch the start_urls from an excel sheet and I need to save the url in the item.

class abc_Spider(BaseSpider):
   name = 'abc'
   allowed_domains = ['abc.com']         
   wb = xlrd.open_workbook(path + '/somefile.xlsx')
   wb.sheet_names()
   sh = wb.sheet_by_name(u'Sheet1')
   first_column = sh.col_values(15)
   start_urls = first_column
   handle_httpstatus_list = [404]

   def parse(self, response):
      item = abcspiderItem()
      item['url'] = response.url

The problem is that the url gets redirected to some other url (and thus gives something else in the response url). How do I get the original url that I got from the excel?

Upvotes: 16

Answers (3)

ahmedshahriar

Reputation: 1076

If anyone is still looking for the answer-

For Scrapy 2.6+ version

use - response.request.headers.get('Referer', None).decode("utf-8")

It will give you the original URL (originally in byte string, hence string conversion)

for more - Scrapy request response

Upvotes: 0

Yusuf Khaled

Reputation: 33

This gave me the original 'referer URL', i.e. which of my start_urls led to the URL corresponding to this request object being scraped:

req = response.request
req_headers = req.__dict__['headers']
referer_url = req_headers['Referer'].decode('utf-8')

Upvotes: 2

alecxe

Reputation: 474171

You can find what you need in response.request.meta['redirect_urls'].

Quote from docs:

The urls which the request goes through (while being redirected) can be found in the redirect_urls Request.meta key.

Hope that helps.

Upvotes: 27

how to get the original start_url in scrapy (before redirect)

Answers (3)

Related Questions