Reputation: 782
I'm trying to scrape tripadvisor's website. I used 2 approaches , the first one is by using CrawlSpiders and Rules. Not so satisfied with the result, i'm now trying to use Selenium to go through each links. The only problem being the pagination issue. I want the selenium browser to go open the webpage and go through each link in the starturl and then click the next page on the bottom. So far i've written code only to extract the required content as:
self.driver.get(response.url)
div_val = self.driver.find_elements_by_xpath('//div[@class="tab_contents"]')
for link in div_val:
l = link.find_element_by_tag_name('a').get_attribute('href')
if re.match(r'http:\/\/www\.tripadvisor\.com\/Hotels\-g[\d]*\-Dominican\_Republic\-Hotels\.html',l):
link.click()
time.sleep(5)
try:
hotel_links = self.driver.find_elements_by_xpath('//div[@class="listing_title"]')
for hotel_link in hotel_links:
lnk = hotel_link.find_element_by_class_name('property_title').get_attribute('href')
except NoSuchElementException:
print 'elemenotfound
I'm now stuck with pagination with selenium.
Upvotes: 1
Views: 897
Reputation: 570
I think a mix of CrawlSpider
and Selenium
will work for you -
for click in range(0,15):#clicking on next button for pagination
button = self.driver.xpath("/html/body/div[3]/div[7]/div[2]/div[7]/div[2]/div[1]/div[3]/div[2]/div/div/div[41]/div[2]/div/a")
button.click()
time.sleep(10)
for i in range(0,10):#range depends upon number of listings you can change it# for entering into the individual url using response
item['url'] = response.xpath('a[contains(@class,"property_title ")]/@href').extract()[i]
if item['url']:
if 'http://' not in item['url']:
item['url'] = urljoin(response.url, item['url'])
yield scrapy.Request(item['url'],
meta={'item': item},
callback=self.anchor_page)
def anchor_page(self, response):
old_item = response.request.meta['item']
data you want to scrape
yield old_item
Upvotes: 1