Reputation: 1113
I'm using Scrapy to crawl some pages. I fetch the start_urls from an excel sheet and I need to save the url in the item.
class abc_Spider(BaseSpider):
name = 'abc'
allowed_domains = ['abc.com']
wb = xlrd.open_workbook(path + '/somefile.xlsx')
wb.sheet_names()
sh = wb.sheet_by_name(u'Sheet1')
first_column = sh.col_values(15)
start_urls = first_column
handle_httpstatus_list = [404]
def parse(self, response):
item = abcspiderItem()
item['url'] = response.url
The problem is that the url gets redirected to some other url (and thus gives something else in the response url). How do I get the original url that I got from the excel?
Upvotes: 16
Views: 7994
Reputation: 1076
If anyone is still looking for the answer-
For Scrapy 2.6+
version
use -
response.request.headers.get('Referer', None).decode("utf-8")
It will give you the original URL (originally in byte string, hence string conversion)
for more - Scrapy request response
Upvotes: 0
Reputation: 33
This gave me the original 'referer URL', i.e. which of my start_urls led to the URL corresponding to this request object being scraped:
req = response.request
req_headers = req.__dict__['headers']
referer_url = req_headers['Referer'].decode('utf-8')
Upvotes: 2