Reputation: 2152
I have simple script that scrape data from amazon you all know there is a captcha so when captcha arrives the page title is 'Robot check' so i have written logic for this situation if page title = 'Robot check'
and print message ' page not be scraped there is captcha code on the page' and do not get data from this page. otherwise continue script.
But in the if part I have tried yield scrapy.Request(response.url, callback=self.parse)
for re request the current URL but I got no success. I Just have to do is re request the response.url
again and continue the script as it is for that i think i have to do is delete the response.url
from log file so scrapy do not remember URL as scraped simple i have to fool the scrapy and request again same URL or may be if there is way to mark response.url
as failed url so scrapy automatically re request.
Here is the simple script and start_urls
is in separate file named urls in same folder so i have imported it from urls file
import scrapy
import re
from urls import start_urls
class AmazondataSpider(scrapy.Spider):
name = 'amazondata'
allowed_domains = ['https://www.amazon.co.uk']
def start_requests(self):
for x in start_urls:
yield scrapy.Request(x, self.parse)
def parse(self, response):
try:
if 'Robot Check' == str(response.xpath('//title/text()').extract_first().encode('utf-8')):
print '\n\n\n The ROBOT CHeCK Page This link is reopening......\n\n\n'
print 'URL : ',response.url,'\n\n'
yield scrapy.Request(response.url, callback=self.parse)
else:
print '\n\nThere is a data in this page no robot check or captcha\n\n'
pgtitle = response.xpath('//title/text()').extract_first().encode('utf-8')
print '\n\n\nhello', pgtitle,'\n\n\n'
if pgtitle == 'Robot check:
# LOGIC FOR GET DATA BY XPATH on RESPONSE
except Exception as e:
print '\n\n\n\n',e,'\n\n\n\n\n'
Upvotes: 2
Views: 1340
Reputation: 21271
Tell Scrapy to not to filter out duplicate links, because, by default Scrapy does not visit a link if its already visited and have received 200
http_status.
In your case,
print '\n\n\n The ROBOT CHeCK Page This link is reopening......\n\n\n'
print 'URL : ',response.url,'\n\n'
yield scrapy.Request(response.url, callback=self.parse, dont_filter=True)
Upvotes: 5