Reputation: 133
At the beginning, I wanted to go there for each hotel :
Unfortunately, there is some kind of javascript process to open this subppage and my script doesn't understand he's here, even with the right URL, he is assuming he is always in the main page :
Hence, I cannot found how scrape all the review with this subpage. So, with the help of one member, we foun that the sub-page I want to loads is from this URL : subpagelink
I saw that I needed to just change the word "hotel" in the main URL and put "hotelfeaturedreviews" instead and I could easily scrape the reviews ->
So I made this script :
from selenium import webdriver
import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from selenium.webdriver.common.keys import Keys
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.get('https://www.booking.com/index.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ&sid=303509179a2849df63e4d1e5bc1ab1e3&srpvid=e6ae6d1417bd00a1&click_from_logo=1')
driver.maximize_window()
time.sleep(2)
headers= {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
cookie = driver.find_element_by_xpath('//*[@id="onetrust-accept-btn-handler"]')
try:
cookie.click()
except:
pass
time.sleep(2)
job_title = driver.find_element_by_xpath('//*[@id="ss"]')
job_title.click()
job_title.send_keys('Paris') #ici on renseigne la ville, attention à la syntaxe
time.sleep(3)
search = driver.find_element_by_xpath('//*[@id="frm"]/div[1]/div[4]/div[2]/button')
search.click()
time.sleep(6)
linksfinal = []
n = 1
for x in range(n): #iterate over n pages
time.sleep(3)
my_elems = driver.find_elements_by_xpath('//a[@class="js-sr-hotel-link hotel_name_link url"]')
links = [my_elem.get_attribute("href") for my_elem in my_elems]
links = [link.replace('\n','') for link in links]
linksfinal = linksfinal + links
time.sleep(3)
next = driver.find_element_by_xpath('//*[@class="bk-icon -iconset-navarrow_right bui-pagination__icon"]')
next.click()
nameshotel = []
for url in linksfinal:
results = requests.get(url, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
name = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
nameshotel.append(name)
for i in range(len(linksfinal)) :
linksfinal[i] = linksfinal[i].replace('hotel','hotelfeaturedreviews')
for url, name in zip(linksfinal, nameshotel) :
commspos = []
commsneg = []
header = []
notes = []
dates = []
datestostay = []
results = requests.get(url, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")
reviews = soup.find_all('li', class_ = "review_item clearfix")
for review in reviews:
try:
commpos = review.find("p", class_ = "review_pos").text.strip()
except:
commpos = 'NA'
commspos.append(commpos)
try:
commneg = review.find("p", class_ = "review_neg").text.strip()
except:
commneg = 'NA'
commsneg.append(commneg)
head = review.find('div', class_ = 'review_item_header_content').text.strip()
header.append(head)
note = review.find('span', class_ = 'review-score-badge').text.strip()
notes.append(note)
date = review.find('p', class_ = 'review_item_date').text[23:].strip()
dates.append(date)
try:
datestay = review.find('p', class_ = 'review_staydate').text[20:].strip()
datestostay.append(datestay)
except:
datestostay.append('NaN')
data = pd.DataFrame({
'commspos' : commspos,
'commsneg' : commsneg,
'headers' : header,
'notes' : notes,
'dates' : dates,
'datestostay' : datestostay,
})
data.to_csv(f"{name}.csv", sep=';', index=False, encoding = 'utf_8_sig')
#data.to_csv(f"{name} + datetime.now().strftime("_%Y_%m_%d-%I_%M_%S").csv", sep=';', index=False)
time.sleep(3)
This script scrape all the links of the hotel I want, store in a list and replace all "hotel" with "hotelfeaturedreviews" and loop over all those links to scrape the reviews for each hotel.
Unofrtunately, there is no next button furthermore I cannot scrape all the review either. At the end of the page, there is nothing like in the subpage at the beginning so I cannot find how to go to the next page of the review with this trick.
I'm kind of lost, do you have any idea how I could overcome that and scrape all the comments I want, having the control over the pages and go wherever reviews pages I wanted ?
Sorry for the hassle, I would lik to be clear so I put a lot of details.
Upvotes: 0
Views: 529
Reputation: 4779
You can do like this:
After using those parameters in the above URL you can scrape that final URL and extract whatever data you need.
Eg:
Upvotes: 1
Reputation: 21
Go thought to your subpagelink you will find the div that contains the next page, for example if you are on first page you can find page number 2:
</div>
<div class="bui-pagination__item ">
<a href="/reviewlist.fr.html?aid=304142;label=gen173nr-1DCAEoggI46AdIM1gEaEaIAQGYAQ24AQfIAQzYAQPoAQGIAgGoAgO4AsXYm4cGwAIB0gIkMDJjN2FmZTQtYTg4YS00NDI5LTlhMDYtNDdmM2IyZWE4Y2Q02AIE4AIB;sid=57b6bff3ca5e1cc8a54e98d7d17ed16a;cc1=es;dist=1;length_of_stay=25;pagename=apartamentos-levante-club;srpvid=a99e5664b05c00da;type=total&;offset=10;rows=10"
class="bui-pagination__link"
data-page-number="2"
>
Get the href from the div and add to the url:https://www.booking.com/reviewlist.fr.html?
With this you will get the second page link:
https://www.booking.com/reviewlist.fr.html?aid=304142;label=gen173nr-1DCAEoggI46AdIM1gEaEaIAQGYAQ24AQfIAQzYAQPoAQGIAgGoAgO4AsXYm4cGwAIB0gIkMDJjN2FmZTQtYTg4YS00NDI5LTlhMDYtNDdmM2IyZWE4Y2Q02AIE4AIB;sid=57b6bff3ca5e1cc8a54e98d7d17ed16a;cc1=es;dist=1;length_of_stay=25;pagename=apartamentos-levante-club;srpvid=a99e5664b05c00da;type=total&;offset=10;rows=10"
On the second page get the div again until the last one.
Upvotes: 0