Looping on Scrapy doens't work properly

Question

I'm trying to write a small web crawler with Scrapy.

I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.

The loop on the links:

cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
    start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])

If, for example, the URL of the page I'm retrieving information from is:

http://www.zap.co.il/ratemodel.aspx?modelid=835959

then more information can (sometimes) be retrieved from following pages, like:

http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2 ("&pageinfo=2" was added).

Therefore, my rules are:

rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
    ), restrict_xpaths=('//a[@class="NumBtn"]',)) 
    , callback="parse_items", follow= True),)

It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?

Thank you!

kev · Accepted Answer

You can override parse_start_url() method in CrawlSpider:

class MySpider(CrawlSpider):

    def parse_items(self, response):
        # put your code here
        ...

    parse_start_url = parse_items

Looping on Scrapy doens't work properly

Answers (2)

Related Questions

Looping on Scrapy doens&#39;t work properly

Answers (2)

Related Questions

Looping on Scrapy doens't work properly