Scrape endpoint with Basic authentication

Question

I am trying to scrape this web. When you select an option in the first selector, the web sends a GET request to this backend endpoint and then fills the next selector options dinamically using Javascript. I want to perform the same GET request with Scrapy, the problem is that you need a Basic authentication key in order to access that endpoint.

The authentication credentials are saved the first time you visit the page, so if you try to access that endpoint through your browser you can do it without problem. However, if you go to a private window and go directly to the endpoint without visiting first the web, you will notice that a floating window appears asking you to authenticate.

I am trying to replicate this behaviour with Scrapy, but when I send the request to the endpoint I get a 401 response.

from scrapy import Spider
from scrapy.http import Request

class MIRSpider(Spider):
    name = 'MIRScrapper'
    allowed_domains = ['infoelectoral.interior.gob.es']
    custom_settings = {
        'SPIDER_MIDDLEWARES': {
            'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None,
            CustomHttpErrorMiddleware: 50
        }
    }

    start_urls = ['https://infoelectoral.interior.gob.es/opencms/es/elecciones-celebradas/area-de-descargas/']
    types_url = 'https://infoelectoral.interior.gob.es/min/convocatorias/tipos/'

    def parse(self, response):
        yield Request(
            url=self.types_url,
            method='GET',
            callback=self.parse_types,
        )

    def parse_types(self, response):
        print(response)

I don't know how to make Scrapy get the authorization credentials when it first visit the starting url, and use them to set the headers in the second request. I have checked my browser network tab, got the Authorization field in the header sent by my browser, and used it like this:

def parse(self, response):
    required_header = {
        'Authorization': 'Basic YXBpSW5mb2VsZWN0b3JhbDphcGlJbmZvZWxlY3RvcmFsUHJv'
    }
    yield Request(
        url=self.types_url,
        method='GET',
        headers=required_header,
        callback=self.parse_types,
    )

I was able to get the information from the endpoint, but I don't think this is a valid solution, as the key may change in the future and I would have to change the code each time it happens.

Isn't there any middleware or something like that which should handle the Basic authoriztion credentials? Do I have to set it in some way?

gangabass · Accepted Answer

Your required_header solution is almost the only way you can get information from this endpoint directly. The other way is to use real browser (Selenium, Splash etc) to iterate with this site (but it will be much slower).

There is no middleware to deal with the Authorization header in this situation since this header is sent dynamically using Javascript (check https://infoelectoral.interior.gob.es/opencms/export/system/modules/com.infoelectoral.mapaleaflet/resources/js/index.js for example) with commands like this:

request.setRequestHeader("Authorization", "Basic "+btoa("apiInfoelectoral:apiInfoelectoralPro"));

Of course, you can create a script that will parse login / password from the above Javascript file but there is no guarantee that the site owner change above snippet...

Scrape endpoint with Basic authentication

Answers (1)

Related Questions