Reputation: 21
I am trying to crawl a website that has pagination. If i click on "next" button at the bottom of page, New items will be generated. My scrapy program is not able to fetch dynamic data. Is there way i can fetch this data?
HTML of next button looks like below
<div id="morePaginationID">
<a href="javascript:void(0);" onclick="lazyPagingNew('db')"></a>
and My Spider is
class ExampleSpider(CrawlSpider):
name = "example"
domain_name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/beauty/90?utm_source=viewallbea"]
rules = ( Rule(SgmlLinkExtractor(allow=('.*',),restrict_xpaths=('//div[@id="morePaginationID"]',)), callback = "parse_zero" , follow= True), )
def parse_zero(self,response):
hxs = HtmlXPathSelector(response)
paths = hxs.select('//div[@id="containerDiv"]/div[@id="loadFilterResults"]/ul[@id="categoryPageListing"]/li')
m = len(paths)
for i in range(m):
item = ExampleItem()
item["dealUrl"] = paths[i].select("figure/figcaption/a/@href").extract()[0]
url = str(item["Url"])
yield Request(url, callback=self.parselevelone, meta={"item":item})
spider = ExampleSpider()
def parselevelone(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta["item"]
item["Title2"] = hxs.select('//div[@class="fullDetail"]/div/figure/figcaption/h2/text()').extract()[0]
items.append(item)
return item
Upvotes: 2
Views: 4548
Reputation: 489
Two ways you can choose! First,you can catch the http request pack to get the JSON or XML origin address than you can crawl them directly.Second,may be you should use spider with crawl javascript function such as pyspider project https://github.com/binux/pyspider
Upvotes: 0
Reputation: 750
What you need to do is this:
1) Open Firefox
2) Run FireBug console
3) GO to the search results page
4) Since the results are changing dynamically and not going to another page, a Javascript code is calling another URL(API) for the next page results
5) See the Firebug console for THIS url
6) You need to set Scrapy to call this same URL that the Javascript function is calling. It will most probably return either a JSON or an XML formatted array of results, which is easy to manipulate in Python
7) Most likely it will have a 'pageNo' variable. So iterate through the page numbers and fetch the results!
Upvotes: 3