Ventor
Ventor

Reputation: 23

How to paginate using Python Selenium in Trip Advisor to extract reviews

I am trying to extract the reviews with their respective titles of a particular hotel in Trip Advisor, using Web Scraping techniques with Python and Selenium but I have only been able to extract the reviews of a single page and I need to extract all or most of the reviews, but the pagination is not working. What I do is click on the Next button, iterate over a range of pages and extract the information.

The web scraping is from the home page https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html notice that when the page changes this is added: -or20- depending if it is the fourth -or30- or fifth -or40- always shows 10 reviews

for example this is the third page: https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-or20-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html

Basically this is what I do: read a csv, open the page, (change to all laguanges, this is optional), expand reviews, read reviews, click on Next button, write to csv, iterate in a range of pages.

Any help, thanks in advance !!

Images: Trip Advisor HTML reviews

Trip Advisor HTML Next button

This is my code so far:

# Web scraping
# Writing to csv
with open('reviews_hotel_9.csv', 'w', encoding="utf-8") as file:
    file.write('titles, reviews \n')
    
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html")
sleep(3)
cookie = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')# cookies accept

try:
    cookie.click()
except:
    pass
print('ok')

for k in range(10): #range pagination
    #container = driver.find_elements_by_xpath("//div[@data-reviewid]")
    try:
        # radio button all languages (optional)
        #driver.execute_script("arguments[0].click();", WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="component_14"]/div/div[3]/div[1]/div[1]/div[4]/ul/li[1]/label/span[1]'))))
        # read more expand reviews
        driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//span[@class="Ignyf _S Z"]')))) 
        # review titles
        titles = driver.find_elements_by_xpath('//div[@class="KgQgP MC _S b S6 H5 _a"]/a/span') #
        sleep(1)
        # reviews
        reviews = driver.find_elements_by_xpath('//q[@class="QewHA H4 _a"]/span')#
        sleep(1)
    except TimeoutException:
        pass
    with open('reviews_hotel_9.csv', 'a', encoding="utf-8") as file:
        for i in range(len(titles)):
            file.write(titles[i].text + ";" + reviews[i].text + "\n")

    try:
        #driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//a[@class="ui_button nav next primary "]')))) # 
        # click on Next button
        driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, 'ui_button nav next primary ')))) 
    except TimeoutException:
        pass
    file.close()
driver.quit()

Upvotes: 2

Views: 609

Answers (1)

Himanshu Poddar
Himanshu Poddar

Reputation: 7799

You can scrape data from each review page by clicking on the Next button on the review page. In the code below we go on an infinite loop and keep clicking the Next button. We stop going ahead when we encounter the exception ElementClickInterceptedException. At the end we are saving the data to the file MyData.csv

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.common.exceptions import ElementClickInterceptedException
import pandas as pd
import time

chrome_path = r"C:\Users\hpoddar\Desktop\Tools\chromedriver_win32\chromedriver.exe"
s = Service(chrome_path)
url = 'https://www.tripadvisor.com/Hotel_Review-g562644-d1490165-Reviews-Parador_de_Alcala_de_Henares-Alcala_De_Henares.html'
driver = webdriver.Chrome(service=s)
driver.get(url)

df = pd.DataFrame(columns = ['title', 'review'])
while True:
    time.sleep(2)
    reviews = driver.find_elements(by=By.CSS_SELECTOR, value='.WAllg._T')
    for review in reviews:
        title = review.find_element(by=By.CSS_SELECTOR, value='.KgQgP.MC._S.b.S6.H5._a').text
        review = review.find_element(by=By.CSS_SELECTOR, value='.fIrGe._T').text
        df.loc[len(df)] = [title, review]
    try:
        driver.find_element(by=By.CSS_SELECTOR, value='.ui_button.nav.next.primary').click()
    except ElementClickInterceptedException:
        break

df.to_csv("MyData.csv")

This gives us the output :

enter image description here

Upvotes: 2

Related Questions