Scrapy recursive scraping fails to crawl all pages

Question

I'm trying to recursively scrape data from a Chinese website. I made my spider follow the "next page" url until no "next page" is available. Below is my spider:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from hrb.items_hrb import HrbItem


class HrbSpider(CrawlSpider):
    name = "hrb"
    allowed_domains = ["www.harbin.gov.cn"]
    start_urls = ["http://bxt.harbin.gov.cn/hrb_bzbxt/list_hf.php"]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=(u'//a[@title="\u4e0b\u4e00\u9875"]',)), callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        items = []
        for sel in response.xpath("//table[3]//tr[position() > 1]"):
            item = HrbItem()
            item['id'] = sel.xpath("td[1]/text()").extract()[0]

            title = sel.xpath("td[3]/a/text()").extract()[0]
            item['title'] = title.encode('gbk') 

            item['time1'] = sel.xpath("td[3]/text()").extract()[0][2:12]
            item['time2'] = sel.xpath("td[5]/text()").extract()[1]

            items.append(item)
        return(items)

The problem is that it only scraped the first 15 pages. I browsed Page 15, and there was still a "next page" button. So why did it stop? Is it intended by the website to prevent scraping? Or there's some problem with my code? And if we are only allowed to scrape 15 pages at a time, is there a way to start scraping from a certain page, say, Page 16? Many thanks!

Jatin Bansal · Accepted Answer

Joseph,

Try analyzing the URLs for the pages your spider is scraping and compare them with the URLs your spider stops scraping at. Also try remove www. from url in your allowed domains. You can allow try to include something like harbin.gov.cn/hrb_bzbxt/list_hf.php.* in allow set of SgmlLinkExtractor.

Hope this might help.

Cheers!!

Scrapy recursive scraping fails to crawl all pages

Answers (1)

Related Questions