Reputation: 31
While scraping a website, I want to get the referer that is pointing to 404s.
def parse_item(self, response):
if response.status == 404:
Do something with this > referer=response.request.headers.get('Referer', None)
It is kind of working but the returned referer is always something like:
\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c
This looks more a memory address than a URL. Am i missing something here?
Thank you !
Bruno
Upvotes: 2
Views: 129
Reputation: 3295
Scrapy has the function referer_str
to handle this for logging purposes. You could use it for your situation as well.
# Python 3.11.7
# Scrapy 2.11.1
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.utils.request import referer_str
class ToySpider(CrawlSpider):
name: str = "toy"
start_urls: list[str] = ["https://quotes.toscrape.com/"]
# Enable action when response status is 308. See below for details.
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html#module-scrapy.spidermiddlewares.httperror
handle_httpstatus_list = [308]
rules: list[Rule] = [
Rule(
link_extractor=LinkExtractor(),
callback="parse_item",
# Don't follow links after those on the ``start_urls``.
# This keeps the example small.
follow=False,
)
]
@staticmethod
def parse_item(response: HtmlResponse) -> dict[str, str | int]:
"""Return the referer and requested URLs."""
# Using the referer_str here!!!
referer = referer_str(request=response.request)
if response.status == 308:
yield {
"referer": referer,
"response_url": response.url,
"status": response.status,
}
Example output:
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Thomas-A-Edison', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Eleanor-Roosevelt', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Steve-Martin', 'status': 308}
referer_str
in scrapy source code:Upvotes: 0
Reputation: 31
Thanks Yanhui. you unlocked me:
It was more simple than is was expecting:
def parse_item(self, response):
if response.status == 404:
Do something with this > referer=response.request.headers.get('Referer', None).decode('utf-8')
Upvotes: 1
Reputation: 116
The leading \x
escape sequence means the next two characters are interpreted as hex digits for the character code.(What does a leading \x
mean in a Python string \xaa
)
\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c
In this case, only one \x
, but the following is still a hex string.
You can decode it and get the URL. XD
>>> # \x need to be remove from the string
>>> str = '68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c'
>>> bytes.fromhex(str)
b'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
>>> bytes.fromhex(str).decode('utf-8')
'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
Upvotes: 1