Reputation: 952
This is a follow up question to the query which I had about scraping web pages.
My earlier question: Pin down exact content location in html for web scraping urllib2 Beautiful Soup
This question is regarding doing the same, but the issue is to do the same recursively over multiple page s/views.
Here is my code
from selenium.webdriver.firefox import web driver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating =review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
From the url, you'll see that no change is seen if we navigate to the second page, otherwise it wouldn't have been an issue. In this case, the next page clicker calls in a javascript from the server. Is there a way we can still scrape this using selenium in python just by some slight modification of my presented code ? Please let me know if there is.
Thanks.
Upvotes: 1
Views: 2256
Reputation: 30136
Just click Next after reading each page:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
while True:
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title,rating
try:
driver.find_element_by_link_text('Next').click()
except:
break
driver.quit()
Or if you want to limit the number of pages that you are reading:
from selenium.webdriver.firefox import webdriver
driver = webdriver.WebDriver()
driver.get('http://www.walmart.com/ip/29701960?page=seeAllReviews')
maxNumOfPages = 10; # for example
for pageId in range(2,maxNumOfPages+2):
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating = review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title,rating
try:
driver.find_element_by_link_text(str(pageId)).click()
except:
break
driver.quit()
Upvotes: 2
Reputation: 9019
I think this would work. Although the python might be a little off, this should give you a starting point:
continue = True
while continue:
try:
for review in driver.find_elements_by_class_name('BVRRReviewDisplayStyle3Main'):
title = review.find_element_by_class_name('BVRRReviewTitle').text
rating =review.find_element_by_xpath('.//div[@class="BVRRRatingNormalImage"]//img').get_attribute('title')
print title, rating
driver.find_element_by_name('BV_TrackingTag_Review_Display_NextPage').click()
except:
print "Done!"
continue = False
Upvotes: 1