praxmon
praxmon

Reputation: 5121

Scraping many pages using scrapy

I am trying to scrape multiple webpages using scrapy. The link of the pages are like:

http://www.example.com/id=some-number

In the next page the number at the end is reduced by 1.

So I am trying to build a spider which navigates to the other pages and scrapes them too. The code that I have is given below:

import scrapy
import requests
from scrapy.http import Request

URL = "http://www.example.com/id=%d"
starting_number = 1000
number_of_pages = 500
class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]

    def start_request(self):
        for i in range (starting_number, number_of_pages, -1):
            yield Request(url = URL % i, callback = self.parse)

    def parse(self, response):
        **parsing data from the webpage**

This is running into an infinite loop where on printing the page number I am getting negative numbers. I think that is happening because I am requesting for a page within my parse() function.

But then the example given here works okay. Where am I going wrong?

Upvotes: 1

Views: 3597

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

The first page requested is "http://www.example.com/id=1000" (starting_number)

It's response goes through parse() and with for i in range (0, 500): you are requesting http://www.example.com/id=999, http://www.example.com/id=998, http://www.example.com/id=997...http://www.example.com/id=500

self.page_number is a spider attribute, so when you're decrementing it's value, you have self.page_number == 500 after the first parse().

So when Scrapy calls parse for the response of http://www.example.com/id=999, you're generating requests for http://www.example.com/id=499, http://www.example.com/id=498, http://www.example.com/id=497...http://www.example.com/id=0

You guess what happens the 3rd time: http://www.example.com/id=-1, http://www.example.com/id=-2...http://www.example.com/id=-500

For each response, you're generating 500 requests.

You can stop the loop by testing self.page_number >= 0


Edit after OP question in comments:

No need for multiple threads, Scrapy works asynchronously and you can enqueue all your requests in an overridden start_requests() method (instead of requesting 1 page, and then returning Request istances in the parse method). Scrapy will take enough requests to fill it's pipeline, parse the pages, pick new requests to send and so on.

See start_requests documentation.

Something like this would work:

class FinalSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ['example.com']
    start_urls = [URL % starting_number]
    def __init__(self):
        self.page_number = starting_number

    def start_requests(self):
        # generate page IDs from 1000 down to 501
        for i in range (self.page_number, number_of_pages, -1):
            yield Request(url = URL % i, callback=self.parse)

    def parse(self, response):
        **parsing data from the webpage**

Upvotes: 4

Related Questions