user2846182
user2846182

Reputation: 21

How to crawl a website that has pagination using Scrapy?

I am trying to crawl a website that has pagination. If i click on "next" button at the bottom of page, New items will be generated. My scrapy program is not able to fetch dynamic data. Is there way i can fetch this data?

HTML of next button looks like below

<div id="morePaginationID">

<a href="javascript:void(0);" onclick="lazyPagingNew('db')"></a>

and My Spider is

class ExampleSpider(CrawlSpider):

    name = "example"
    domain_name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/beauty/90?utm_source=viewallbea"]
    rules = ( Rule(SgmlLinkExtractor(allow=('.*',),restrict_xpaths=('//div[@id="morePaginationID"]',)), callback = "parse_zero" , follow= True), )
    def parse_zero(self,response):
        hxs = HtmlXPathSelector(response)
        paths = hxs.select('//div[@id="containerDiv"]/div[@id="loadFilterResults"]/ul[@id="categoryPageListing"]/li')
        m = len(paths)
        for i in range(m):

            item = ExampleItem()

            item["dealUrl"] = paths[i].select("figure/figcaption/a/@href").extract()[0]

            url = str(item["Url"])
            yield Request(url, callback=self.parselevelone, meta={"item":item})
        spider = ExampleSpider()
    def parselevelone(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.meta["item"]
        item["Title2"] = hxs.select('//div[@class="fullDetail"]/div/figure/figcaption/h2/text()').extract()[0]
        items.append(item)
        return item

Upvotes: 2

Views: 4548

Answers (2)

lazybios
lazybios

Reputation: 489

Two ways you can choose! First,you can catch the http request pack to get the JSON or XML origin address than you can crawl them directly.Second,may be you should use spider with crawl javascript function such as pyspider project https://github.com/binux/pyspider

Upvotes: 0

zenCoder
zenCoder

Reputation: 750

What you need to do is this:

1) Open Firefox

2) Run FireBug console

3) GO to the search results page

4) Since the results are changing dynamically and not going to another page, a Javascript code is calling another URL(API) for the next page results

5) See the Firebug console for THIS url

6) You need to set Scrapy to call this same URL that the Javascript function is calling. It will most probably return either a JSON or an XML formatted array of results, which is easy to manipulate in Python

7) Most likely it will have a 'pageNo' variable. So iterate through the page numbers and fetch the results!

Upvotes: 3

Related Questions