sulav_lfc
sulav_lfc

Reputation: 782

scrapy selenium pagination

I'm trying to scrape tripadvisor's website. I used 2 approaches , the first one is by using CrawlSpiders and Rules. Not so satisfied with the result, i'm now trying to use Selenium to go through each links. The only problem being the pagination issue. I want the selenium browser to go open the webpage and go through each link in the starturl and then click the next page on the bottom. So far i've written code only to extract the required content as:

    self.driver.get(response.url)
    div_val = self.driver.find_elements_by_xpath('//div[@class="tab_contents"]')
    for link in div_val:
        l = link.find_element_by_tag_name('a').get_attribute('href')
        if re.match(r'http:\/\/www\.tripadvisor\.com\/Hotels\-g[\d]*\-Dominican\_Republic\-Hotels\.html',l):
            link.click()
            time.sleep(5)

                try:
                    hotel_links = self.driver.find_elements_by_xpath('//div[@class="listing_title"]')
                    for hotel_link in hotel_links:
                        lnk = hotel_link.find_element_by_class_name('property_title').get_attribute('href')

                except NoSuchElementException:
                    print 'elemenotfound

I'm now stuck with pagination with selenium.

Upvotes: 1

Views: 897

Answers (1)

John Dene
John Dene

Reputation: 570

I think a mix of CrawlSpider and Selenium will work for you -

for click in range(0,15):#clicking on next button for pagination
    button = self.driver.xpath("/html/body/div[3]/div[7]/div[2]/div[7]/div[2]/div[1]/div[3]/div[2]/div/div/div[41]/div[2]/div/a")

    button.click()

    time.sleep(10)

    for i in range(0,10):#range depends upon number of listings you can change it# for entering into the individual url using response
            item['url'] = response.xpath('a[contains(@class,"property_title ")]/@href').extract()[i]
            if item['url']:
                                                    if 'http://' not in item['url']:
                                                        item['url'] = urljoin(response.url, item['url'])
                                                    yield scrapy.Request(item['url'],
                                                                        meta={'item': item},
                                                                        callback=self.anchor_page)


                def anchor_page(self, response):

                    old_item = response.request.meta['item']

                    data you want to scrape
                    yield old_item

Upvotes: 1

Related Questions