Cheshie
Cheshie

Reputation: 2837

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.

I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.

The loop on the links:

cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
    start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])

If, for example, the URL of the page I'm retrieving information from is:

http://www.zap.co.il/ratemodel.aspx?modelid=835959

then more information can (sometimes) be retrieved from following pages, like:

http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2 ("&pageinfo=2" was added).

Therefore, my rules are:

rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
    ), restrict_xpaths=('//a[@class="NumBtn"]',)) 
    , callback="parse_items", follow= True),)

It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?

Thank you!

Upvotes: 0

Views: 96

Answers (2)

Biswanath
Biswanath

Reputation: 9185

Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.

Upvotes: 0

kev
kev

Reputation: 161704

You can override parse_start_url() method in CrawlSpider:

class MySpider(CrawlSpider):

    def parse_items(self, response):
        # put your code here
        ...

    parse_start_url = parse_items

Upvotes: 2

Related Questions