Reputation: 33
I'm trying to use a Scrapy Spider to solve a problem (a programming question from HackThisSite):
(1) I have to log in a website, giving a username and a password (already done)
(2) After that, I have to access an image with a given URL (the image is only accessible to logged in users)
(3) Then, without saving the image in the hard disk, I have to read its information in a kind of buffer
(4) And the result of the function will fill a form and send the data to the website server (I already know how to do this step)
So, I can resume to question to: would it be possible (using a spider) to read an image accessible only by logged-in users and process it in the spider code?
I tried to research different methods, using item pipelines is not a good approach (I don't want to download the file).
The code that I already have is:
class ProgrammingQuestion2(Spider):
name = 'p2'
start_urls = ['https://www.hackthissite.org/']
def parse(self, response):
formdata_hts = {'username': <MY_USER_NAME>,
'password': <MY_PASSWORD>,
'btn_submit': 'Login'}
return FormRequest.from_response(response,
formdata=formdata_hts, callback=self.redirect_to_page)
def redirect_to_page(self, response):
yield Request(url='https://www.hackthissite.org/missions/prog/2/',
callback=self.solve_question_2)
def solve_question_2(self, response):
open_in_browser(response)
img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
# What can I do here?
I expect to solve this problem using Scrapy functions, otherwise it would be necessary to log in the website (sending the form data) again.
Upvotes: 3
Views: 315
Reputation: 21436
You can make a scrapy request to crawl the image and then callback to some other endpoint:
def parse_page(self, response):
img_url = 'https://www.hackthissite.org/missions/prog/2/PNG'
yield Request(img_url, callback=self.parse_image)
def parse_image(self, response):
image_bytes = response.body
form_data = form_from_image(image_bytes)
# make form request
Upvotes: 2