Processing images without downloading using Scrapy Spiders

Question

I'm trying to use a Scrapy Spider to solve a problem (a programming question from HackThisSite):

(1) I have to log in a website, giving a username and a password (already done)

(2) After that, I have to access an image with a given URL (the image is only accessible to logged in users)

(3) Then, without saving the image in the hard disk, I have to read its information in a kind of buffer

(4) And the result of the function will fill a form and send the data to the website server (I already know how to do this step)

So, I can resume to question to: would it be possible (using a spider) to read an image accessible only by logged-in users and process it in the spider code?

I tried to research different methods, using item pipelines is not a good approach (I don't want to download the file).

The code that I already have is:

class ProgrammingQuestion2(Spider):

    name = 'p2'
    start_urls = ['https://www.hackthissite.org/']

    def parse(self, response):

        formdata_hts = {'username': ,
                'password': ,
                'btn_submit': 'Login'}

        return FormRequest.from_response(response,
                formdata=formdata_hts, callback=self.redirect_to_page)

    def redirect_to_page(self, response):

        yield Request(url='https://www.hackthissite.org/missions/prog/2/',
                callback=self.solve_question_2)

    def solve_question_2(self, response):

        open_in_browser(response)
        img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
        # What can I do here?

I expect to solve this problem using Scrapy functions, otherwise it would be necessary to log in the website (sending the form data) again.

Granitosaurus · Accepted Answer

You can make a scrapy request to crawl the image and then callback to some other endpoint:

def parse_page(self, response):
    img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
    yield Request(img_url, callback=self.parse_image)

def parse_image(self, response):
    image_bytes = response.body
    form_data = form_from_image(image_bytes)
    # make form request

Processing images without downloading using Scrapy Spiders

Answers (1)

Related Questions