Re request the URLs or URL from parse() in python scrapy manually

Question

I have simple script that scrape data from amazon you all know there is a captcha so when captcha arrives the page title is 'Robot check' so i have written logic for this situation if page title = 'Robot check' and print message ' page not be scraped there is captcha code on the page' and do not get data from this page. otherwise continue script.

But in the if part I have tried yield scrapy.Request(response.url, callback=self.parse) for re request the current URL but I got no success. I Just have to do is re request the response.url again and continue the script as it is for that i think i have to do is delete the response.url from log file so scrapy do not remember URL as scraped simple i have to fool the scrapy and request again same URL or may be if there is way to mark response.url as failed url so scrapy automatically re request.

Here is the simple script and start_urls is in separate file named urls in same folder so i have imported it from urls file

import scrapy
import re
from urls import start_urls

class AmazondataSpider(scrapy.Spider):
    name = 'amazondata'
    allowed_domains = ['https://www.amazon.co.uk']
    def start_requests(self):
        for x in start_urls:
            yield scrapy.Request(x, self.parse)

    def parse(self, response):
        try:
            if 'Robot Check' == str(response.xpath('//title/text()').extract_first().encode('utf-8')):
                print '


 The ROBOT CHeCK Page This link is reopening......


'
                print 'URL : ',response.url,'

'
                yield scrapy.Request(response.url, callback=self.parse)
            else:
                print '

There is a data in this page no robot check or captcha

'
                pgtitle = response.xpath('//title/text()').extract_first().encode('utf-8')
                print '


hello', pgtitle,'


'
                if pgtitle == 'Robot check:
                    # LOGIC FOR GET DATA BY XPATH on RESPONSE
        except Exception as e:
            print '



',e,'




'

Umair Ayub · Accepted Answer

Tell Scrapy to not to filter out duplicate links, because, by default Scrapy does not visit a link if its already visited and have received 200 http_status.

Do dont_filter=True

In your case,

print '


 The ROBOT CHeCK Page This link is reopening......


'
print 'URL : ',response.url,'

'
yield scrapy.Request(response.url, callback=self.parse, dont_filter=True)

Re request the URLs or URL from parse() in python scrapy manually

Answers (1)

Related Questions