Reputation: 23
I am trying to extract the reviews with their respective titles of a particular hotel in Trip Advisor, using Web Scraping techniques with Python and Selenium but I have only been able to extract the reviews of a single page and I need to extract all or most of the reviews, but the pagination is not working. What I do is click on the Next button, iterate over a range of pages and extract the information.
The web scraping is from the home page https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html notice that when the page changes this is added: -or20- depending if it is the fourth -or30- or fifth -or40- always shows 10 reviews
for example this is the third page: https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-or20-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html
Basically this is what I do: read a csv, open the page, (change to all laguanges, this is optional), expand reviews, read reviews, click on Next button, write to csv, iterate in a range of pages.
Any help, thanks in advance !!
Images: Trip Advisor HTML reviews
This is my code so far:
# Web scraping
# Writing to csv
with open('reviews_hotel_9.csv', 'w', encoding="utf-8") as file:
file.write('titles, reviews \n')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html")
sleep(3)
cookie = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')# cookies accept
try:
cookie.click()
except:
pass
print('ok')
for k in range(10): #range pagination
#container = driver.find_elements_by_xpath("//div[@data-reviewid]")
try:
# radio button all languages (optional)
#driver.execute_script("arguments[0].click();", WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="component_14"]/div/div[3]/div[1]/div[1]/div[4]/ul/li[1]/label/span[1]'))))
# read more expand reviews
driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//span[@class="Ignyf _S Z"]'))))
# review titles
titles = driver.find_elements_by_xpath('//div[@class="KgQgP MC _S b S6 H5 _a"]/a/span') #
sleep(1)
# reviews
reviews = driver.find_elements_by_xpath('//q[@class="QewHA H4 _a"]/span')#
sleep(1)
except TimeoutException:
pass
with open('reviews_hotel_9.csv', 'a', encoding="utf-8") as file:
for i in range(len(titles)):
file.write(titles[i].text + ";" + reviews[i].text + "\n")
try:
#driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//a[@class="ui_button nav next primary "]')))) #
# click on Next button
driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, 'ui_button nav next primary '))))
except TimeoutException:
pass
file.close()
driver.quit()
Upvotes: 2
Views: 609
Reputation: 7799
You can scrape data from each review page by clicking on the Next
button on the review page. In the code below we go on an infinite loop and keep clicking the Next
button. We stop going ahead when we encounter the exception ElementClickInterceptedException
. At the end we are saving the data to the file MyData.csv
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common.exceptions import ElementClickInterceptedException
import pandas as pd
import time
chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
url = 'https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html'
driver = webdriver.Chrome(service=s)
driver.get(url)
df = pd.DataFrame(columns = ['title', 'review'])
while True:
time.sleep(2)
reviews = driver.find_elements(by=By.CSS_SELECTOR, value='.WAllg._T')
for review in reviews:
title = review.find_element(by=By.CSS_SELECTOR, value='.KgQgP.MC._S.b.S6.H5._a').text
review = review.find_element(by=By.CSS_SELECTOR, value='.fIrGe._T').text
df.loc[len(df)] = [title, review]
try:
driver.find_element(by=By.CSS_SELECTOR, value='.ui_button.nav.next.primary').click()
except ElementClickInterceptedException:
break
df.to_csv("MyData.csv")
This gives us the output :
Upvotes: 2