Reputation: 1824
I am trying to get the titles of Booking.com comments from this website:
https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75,
where r_lang=all
basically says that the website should show comments in every language.
In order to obtain the titles from this page I do this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
From the website (see screenshot), the first two titles should be "Sencillamente placentera" and "It could have been great.". However, somehow the url only loads comments in spanish: “Sencillamente placentera”
“La atención de la chica del restaurante”
“El desayuno estilo buffet, completo ”
“Me gusto la ubicación, y la vista.”
“Su ubicación es muy buena.”
I noticed that if in the url I change the 'museo.es.' to 'museo.en.', I get the headers of english comments. But this is inconsistent, because if I load the original url, I get comments in english, french, spanish, etc. How can I fix this? Thanks
Upvotes: 5
Views: 2108
Reputation: 21436
New way to access Booking.com reviews is to use the new reviewlist.html
endpoint. For example for hotel in original question reviews are located over at:
This endpoint is particularly great because it supports many filters and offers up to 25 reviews per page.
Here's a snippet in Python with parsel
and httpx
:
def parse_reviews(html: str) -> List[dict]:
"""parse review page for review data """
sel = Selector(text=html)
parsed = []
for review_box in sel.css('.review_list_new_item_block'):
get_css = lambda css: review_box.css(css).get("").strip()
parsed.append({
"id": review_box.xpath('@data-review-url').get(),
"score": get_css('.bui-review-score__badge::text'),
"title": get_css('.c-review-block__title::text'),
"date": get_css('.c-review-block__date::text'),
"user_name": get_css('.bui-avatar-block__title::text'),
"user_country": get_css('.bui-avatar-block__subtitle::text'),
"text": ''.join(review_box.css('.c-review__body ::text').getall()),
"lang": review_box.css('.c-review__body::attr(lang)').get(),
})
return parsed
async def scrape_reviews(hotel_id: str, session) -> List[dict]:
"""scrape all reviews of a hotel"""
async def scrape_page(page, page_size=25): # 25 is largest possible page size for this endpoint
url = "https://www.booking.com/reviewlist.html?" + urlencode(
{
"type": "total",
# we can configure language preference
"lang": "en-us",
# we can configure sorting order here, in this case recent reviews are first
"sort": "f_recent_desc",
"cc1": "gb", # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
"dist": 1,
"pagename": hotel_id,
"rows": page_size,
"offset": page * page_size,
}
)
return await session.get(url)
first_page = await scrape_page(1)
total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
total_pages = max(int(page) for page in total_pages)
other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])
results = []
for response in [first_page, *other_pages]:
results.extend(parse_reviews(response.text))
return results
I write more about scraping this endpoint on my blog How to Scrape Booking.com which has more illustrations and videos if more information is needed.
Upvotes: 1
Reputation: 84465
You could always use a browser as a plan B. Selenium doesn't have this problem
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75')
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
print(titles)
Upvotes: 3
Reputation: 8225
Servers can be configured to send different responses based on the browser making the request. Adding a User-Agent
seems to fix the problem.
import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75'
req = urllib.request.Request(
url,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
}
)
f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
print(review.find("div", {"class": "review_item_header_content"}).text)
Output:
“Sencillamente placentera”
“It could had been great.”
“will never stay their in the future.”
“Hôtel bien situé.”
...
Upvotes: 4