Reputation: 863
The answer of this question was quite difficult to find since informations are scattered, and the title of the questions are sometime misleading. The answer below regroup all informations needed in one place.
Upvotes: 0
Views: 1380
Reputation: 863
Your spider should look like.
# based on https://doc.scrapy.org/en/latest/intro/tutorial.html
import scrapy
import requests
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
print('\n\nurl:', url)
## use one of the yield below
# middleware will process the request
yield scrapy.Request(url=url, callback=self.parse)
# check if Tor has changed IP
#yield scrapy.Request('http://icanhazip.com/', callback=self.is_tor_and_privoxy_used)
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
print('\n\nSpider: Start')
print('Is proxy in response.meta?: ', response.meta)
print ("user_agent is: ",response.request.headers['User-Agent'])
print('\n\n Spider: End')
self.log('Saved file --- %s' % filename)
def is_tor_and_privoxy_used(self, response):
print('\n\nSpider: Start')
print("My IP is : " + str(response.body))
print("Is proxy in response.meta?: ", response.meta) # not header dispo
print('\n\nSpider: End')
self.log('Saved file %s' % filename)
You will also need to add stuff in middleware.py and settings.py . If you don't know how to do it this will help you
Upvotes: 0