Reputation: 5121
I am trying to scrape multiple webpages using scrapy. The link of the pages are like:
http://www.example.com/id=some-number
In the next page the number at the end is reduced by 1.
So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:
import scrapy
import requests
from scrapy.http import Request
URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def start_request(self):
for i in range (starting_number, number_of_pages, -1):
yield Request(url = URL % i, callback = self.parse)
def parse(self, response):
**parsing data from the webpage**
This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse()
function.
But then the example given here works okay. Where am I going wrong?
Upvotes: 1
Views: 3597
Reputation: 20748
The first page requested is "http://www.example.com/id=1000" (starting_number
)
It's response goes through parse()
and with for i in range (0, 500):
you are requesting http://www.example.com/id=999
, http://www.example.com/id=998
, http://www.example.com/id=997
...http://www.example.com/id=500
self.page_number
is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500
after the first parse()
.
So when Scrapy calls parse
for the response of http://www.example.com/id=999
, you're generating requests for http://www.example.com/id=499
, http://www.example.com/id=498
, http://www.example.com/id=497
...http://www.example.com/id=0
You guess what happens the 3rd time: http://www.example.com/id=-1
, http://www.example.com/id=-2
...http://www.example.com/id=-500
For each response, you're generating 500 requests.
You can stop the loop by testing self.page_number >= 0
Edit after OP question in comments:
No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests()
method (instead of requesting 1 page, and then returning Request
istances in the parse
method).
Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.
See start_requests documentation.
Something like this would work:
class FinalSpider(scrapy.Spider):
name = "final"
allowed_domains = ['example.com']
start_urls = [URL % starting_number]
def __init__(self):
self.page_number = starting_number
def start_requests(self):
# generate page IDs from 1000 down to 501
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i, callback=self.parse)
def parse(self, response):
**parsing data from the webpage**
Upvotes: 4