Bruno Fernandes
Bruno Fernandes

Reputation: 31

Scrapy referer not returning readable url

While scraping a website, I want to get the referer that is pointing to 404s.

def parse_item(self, response):

    if response.status == 404:
        Do something with this > referer=response.request.headers.get('Referer', None)

It is kind of working but the returned referer is always something like:

\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c

This looks more a memory address than a URL. Am i missing something here?

Thank you !

Bruno

Upvotes: 2

Views: 129

Answers (3)

Ian Thompson
Ian Thompson

Reputation: 3295

Scrapy has the function referer_str to handle this for logging purposes. You could use it for your situation as well.


MRE

# Python 3.11.7
# Scrapy 2.11.1

from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.utils.request import referer_str


class ToySpider(CrawlSpider):
    name: str = "toy"
    start_urls: list[str] = ["https://quotes.toscrape.com/"]
    # Enable action when response status is 308. See below for details.
    # https://doc.scrapy.org/en/latest/topics/spider-middleware.html#module-scrapy.spidermiddlewares.httperror
    handle_httpstatus_list = [308]

    rules: list[Rule] = [
        Rule(
            link_extractor=LinkExtractor(),
            callback="parse_item",
            # Don't follow links after those on the ``start_urls``.
            # This keeps the example small.
            follow=False,
        )
    ]

    @staticmethod
    def parse_item(response: HtmlResponse) -> dict[str, str | int]:
        """Return the referer and requested URLs."""
        # Using the referer_str here!!!
        referer = referer_str(request=response.request)
        if response.status == 308:
            yield {
                "referer": referer,
                "response_url": response.url,
                "status": response.status,
            }

Example output:

{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Thomas-A-Edison', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Eleanor-Roosevelt', 'status': 308}
{'referer': 'https://quotes.toscrape.com/', 'response_url': 'https://quotes.toscrape.com/author/Steve-Martin', 'status': 308}

Example usage of referer_str in scrapy source code:

Upvotes: 0

Bruno Fernandes
Bruno Fernandes

Reputation: 31

Thanks Yanhui. you unlocked me:

It was more simple than is was expecting:

def parse_item(self, response):

    if response.status == 404:
        Do something with this > referer=response.request.headers.get('Referer', None).decode('utf-8')

Upvotes: 1

Y4nhu1
Y4nhu1

Reputation: 116

The leading \x escape sequence means the next two characters are interpreted as hex digits for the character code.(What does a leading \x mean in a Python string \xaa)

\x68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c

In this case, only one \x, but the following is still a hex string. You can decode it and get the URL. XD

>>> # \x need to be remove from the string
>>> str = '68747470733a2f2f7777772e6162752d64686162692e6d657263656465732d62656e7a2d6d656e612e636f6d2f61722f70617373656e676572636172732f6d657263656465732d62656e7a2d636172732f6d6f64656c732f676c652f636f7570652d633136372f6578706c6f72652e68746d6c'
>>> bytes.fromhex(str)
b'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'
>>> bytes.fromhex(str).decode('utf-8')
'https://www.abu-dhabi.mercedes-benz-mena.com/ar/passengercars/mercedes-benz-cars/models/gle/coupe-c167/explore.html'

Upvotes: 1

Related Questions