RandallCloud
RandallCloud

Reputation: 133

Trouble when wbscraping booking.com

At the beginning, I wanted to go there for each hotel :

subpage

Unfortunately, there is some kind of javascript process to open this subppage and my script doesn't understand he's here, even with the right URL, he is assuming he is always in the main page :

mainpage

Hence, I cannot found how scrape all the review with this subpage. So, with the help of one member, we foun that the sub-page I want to loads is from this URL : subpagelink

I saw that I needed to just change the word "hotel" in the main URL and put "hotelfeaturedreviews" instead and I could easily scrape the reviews -> changeURL

So I made this script :

from selenium import webdriver
import time    
from selenium.webdriver.support.select import Select   
from selenium.webdriver.support.ui import WebDriverWait     
from selenium.webdriver.common.by import By     
from selenium.webdriver.support import expected_conditions as EC   
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np    
from selenium.webdriver.common.keys import Keys

PATH = "driver\chromedriver.exe"

options = webdriver.ChromeOptions() 
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
      
driver = webdriver.Chrome(options=options, executable_path=PATH)

driver.get('https://www.booking.com/index.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ&sid=303509179a2849df63e4d1e5bc1ab1e3&srpvid=e6ae6d1417bd00a1&click_from_logo=1')
driver.maximize_window()
time.sleep(2)

headers= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}


cookie = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
try:
    cookie.click()
except:
    pass

time.sleep(2)
       
job_title = driver.find_element_by_xpath('//*[@id="ss"]')
job_title.click()
job_title.send_keys('Paris') #ici on renseigne la ville, attention à la syntaxe
time.sleep(3)

search = driver.find_element_by_xpath('//*[@id="frm"]/div[1]/div[4]/div[2]/button')
search.click()
time.sleep(6)

linksfinal = []

n = 1 

for x in range(n): #iterate over n pages

    time.sleep(3)

    my_elems = driver.find_elements_by_xpath('//a[@class="js-sr-hotel-link hotel_name_link url"]')

    links = [my_elem.get_attribute("href") for my_elem in my_elems]

    links = [link.replace('\n','') for link in links]

    linksfinal = linksfinal + links

    time.sleep(3) 

    next = driver.find_element_by_xpath('//*[@class="bk-icon -iconset-navarrow_right bui-pagination__icon"]')
      
    next.click()

nameshotel = []
for url in linksfinal:
    results = requests.get(url, headers = headers)
    soup = BeautifulSoup(results.text, "html.parser")
    name = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
    nameshotel.append(name)

for i in range(len(linksfinal)) :
    linksfinal[i] = linksfinal[i].replace('hotel','hotelfeaturedreviews')

    
for url, name in zip(linksfinal, nameshotel) :

    commspos = []
    commsneg = []
    header = []
    notes = []
    dates = []
    datestostay = []

    results = requests.get(url, headers = headers)

    soup = BeautifulSoup(results.text, "html.parser")

    reviews = soup.find_all('li', class_ = "review_item clearfix")


    for review in reviews:
        try:
            commpos  = review.find("p", class_  = "review_pos").text.strip()
        except:
            commpos = 'NA'

        commspos.append(commpos)



        try:
            commneg  = review.find("p", class_  = "review_neg").text.strip()
        except:
            commneg = 'NA'

        commsneg.append(commneg)


        head = review.find('div', class_ = 'review_item_header_content').text.strip()
        header.append(head)


        note = review.find('span', class_ = 'review-score-badge').text.strip()
        notes.append(note)


        date = review.find('p', class_ = 'review_item_date').text[23:].strip()
        dates.append(date)


        try:
            datestay = review.find('p', class_ = 'review_staydate').text[20:].strip()
            datestostay.append(datestay)
        except:
            datestostay.append('NaN')


    data = pd.DataFrame({
        'commspos' : commspos,
        'commsneg' : commsneg,
        'headers' : header,
        'notes' : notes,
        'dates' : dates,
        'datestostay' : datestostay,
        })


    data.to_csv(f"{name}.csv", sep=';', index=False, encoding = 'utf_8_sig')
    #data.to_csv(f"{name} + datetime.now().strftime("_%Y_%m_%d-%I_%M_%S").csv", sep=';', index=False)

    time.sleep(3)

This script scrape all the links of the hotel I want, store in a list and replace all "hotel" with "hotelfeaturedreviews" and loop over all those links to scrape the reviews for each hotel.

Unofrtunately, there is no next button furthermore I cannot scrape all the review either. At the end of the page, there is nothing like in the subpage at the beginning so I cannot find how to go to the next page of the review with this trick.

I'm kind of lost, do you have any idea how I could overcome that and scrape all the comments I want, having the control over the pages and go wherever reviews pages I wanted ?

Sorry for the hassle, I would lik to be clear so I put a lot of details.

Upvotes: 0

Views: 529

Answers (2)

Ram
Ram

Reputation: 4779

You can do like this:

  • From a Hotel URL extract the following parameters values. You can easily find them out in the URL.
    • cc1 - This is the country code of the Hotel.
    • pagename - Name of the hotel from the URL
    • label
    • sid
    • srpvid
  • Use those values in the following URL. This URL will give you the reviews.

https://www.booking.com/reviewlist.html?label={}&sid={}&cc1={}&pagename{}&srpvid={}&type=total&offset=0&rows=10

  • rows=10 - will display 10 reviews per page. You can change them accordingly
  • offset=0 - points to first page; offset = 10 points to second page and so on..

After using those parameters in the above URL you can scrape that final URL and extract whatever data you need.

Eg:

For this Hotel link, the reviews site is this

Upvotes: 1

marc29
marc29

Reputation: 21

Go thought to your subpagelink you will find the div that contains the next page, for example if you are on first page you can find page number 2:

</div>
<div class="bui-pagination__item ">
<a href="/reviewlist.fr.html?aid=304142;label=gen173nr-1DCAEoggI46AdIM1gEaEaIAQGYAQ24AQfIAQzYAQPoAQGIAgGoAgO4AsXYm4cGwAIB0gIkMDJjN2FmZTQtYTg4YS00NDI5LTlhMDYtNDdmM2IyZWE4Y2Q02AIE4AIB;sid=57b6bff3ca5e1cc8a54e98d7d17ed16a;cc1=es;dist=1;length_of_stay=25;pagename=apartamentos-levante-club;srpvid=a99e5664b05c00da;type=total&amp;;offset=10;rows=10"
class="bui-pagination__link"
data-page-number="2"
>

Get the href from the div and add to the url:https://www.booking.com/reviewlist.fr.html?

With this you will get the second page link:

https://www.booking.com/reviewlist.fr.html?aid=304142;label=gen173nr-1DCAEoggI46AdIM1gEaEaIAQGYAQ24AQfIAQzYAQPoAQGIAgGoAgO4AsXYm4cGwAIB0gIkMDJjN2FmZTQtYTg4YS00NDI5LTlhMDYtNDdmM2IyZWE4Y2Q02AIE4AIB;sid=57b6bff3ca5e1cc8a54e98d7d17ed16a;cc1=es;dist=1;length_of_stay=25;pagename=apartamentos-levante-club;srpvid=a99e5664b05c00da;type=total&amp;;offset=10;rows=10"

On the second page get the div again until the last one.

Upvotes: 0

Related Questions