Umair Ayub
Umair Ayub

Reputation: 21201

Wait for a Request to complete - Python Scrapy

I have a Scrapy Spider which scrapes a website and that website requires to refresh a token to be able to access them.

def get_ad(self, response):
    temp_dict = AppextItem()
    try:
        Selector(response).xpath('//div[@class="messagebox"]').extract()[0]
        print("Captcha found when scraping ID "+ response.meta['id'] + " LINK: "+response.meta['link'])
        self.p_token = ''

        return Request(url = url_, callback=self.get_p_token, method = "GET",priority=1, meta = response.meta)

    except Exception:
        print("Captcha was not found")

I have a get_p_token method that refreshes token and assigns to self.p_token

get_p_token is called when Captcha is found, but problem is, other Requests keep executing.

I want that if Captcha is found, do not make next request until execution of get_p_token is finished.

I have priority=1 but that does not help.

HERE is full code of Spider

P.S:

Actually that token is passed to each URL so that is why I want to wait until a new token is found and then scrape the rest of URLs.

Upvotes: 6

Views: 3464

Answers (2)

Gallaecio
Gallaecio

Reputation: 3847

You should implement your CAPTCHA solving logic as a middleware. See captcha-middleware for inspiration.

The middleware should take care of assigning the right token to requests (from process_request()) and detect CAPTCHA prompts (from process_response()).

Within the middleware, you can use something other than Scrapy (e.g. requests) to perform the requests needed for CAPTCHA solving in a synchronous way that prevents new requests from starting until done.

Of course, any already triggered parallel request would have started already, so it is technically possible for a few requests to be sent without an updated token. However, those should be retried automatically. You can configure your middleware to update the tokens of those requests upon retrying by making sure your middleware works nicely with the retry middleware.

Upvotes: 2

Rafael Almeida
Rafael Almeida

Reputation: 5240

This is how I would go on about it:

def get_p_token(self, response):
    # generate token
    ...
    yield Request(url = response.url, callback=self.no_captcha, method = "GET",priority=1, meta = response.meta, dont_filter=True)


def get_ad(self, response):
    temp_dict = AppextItem()
    try:
        Selector(response).xpath('//div[@class="messagebox"]').extract()[0]
        print("Captcha found when scraping ID "+ response.meta['id'] + " LINK: "+response.meta['link'])
        self.p_token = ''

        yield Request(url = url_, callback=self.get_p_token, method = "GET",priority=1, meta = response.meta)

    except Exception:
        print("Captcha was not found")
        yield Request(url = url_, callback=self.no_captcha, method = "GET",priority=1, meta = response.meta)

You haven't provided working code so this is only a demonstration of the problem...The logic here is pretty simple:

If a captcha is found it goes to get_p_token and after generating the token, it requests the url that you were requesting before. If no captcha is found it goes on as normal.

Upvotes: 0

Related Questions