Vladimir Vargas
Vladimir Vargas

Reputation: 1824

Scraping Booking coments with python

I am trying to get the titles of Booking.com comments from this website:

https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75,

where r_lang=all basically says that the website should show comments in every language.

In order to obtain the titles from this page I do this:

from urllib.request import urlopen
from bs4 import BeautifulSoup

page = urlopen(url)
soup = BeautifulSoup(page)
reviews = soup.findAll("li", {"class": "review_item clearfix "})

for review in reviews:
    print(review.find("div", {"class": "review_item_header_content"}).text)

From the website (see screenshot), the first two titles should be "Sencillamente placentera" and "It could have been great.". However, somehow the url only loads comments in spanish: “Sencillamente placentera”

“La atención de la chica del restaurante”

“El desayuno estilo buffet, completo ”

“Me gusto la ubicación, y la vista.”

“Su ubicación es muy buena.”

I noticed that if in the url I change the 'museo.es.' to 'museo.en.', I get the headers of english comments. But this is inconsistent, because if I load the original url, I get comments in english, french, spanish, etc. How can I fix this? Thanks

enter image description here

Upvotes: 5

Views: 2108

Answers (3)

Granitosaurus
Granitosaurus

Reputation: 21436

New way to access Booking.com reviews is to use the new reviewlist.html endpoint. For example for hotel in original question reviews are located over at:

https://www.booking.com/reviewlist.html?pagename=ibis-bogota-museo&type=total&lang=en-us&sort=f_recent_desc&cc1=co&dist=1&rows=25&offset=0

This endpoint is particularly great because it supports many filters and offers up to 25 reviews per page.

Here's a snippet in Python with parsel and httpx:

def parse_reviews(html: str) -> List[dict]:
    """parse review page for review data """
    sel = Selector(text=html)
    parsed = []
    for review_box in sel.css('.review_list_new_item_block'):
        get_css = lambda css: review_box.css(css).get("").strip()
        parsed.append({
            "id": review_box.xpath('@data-review-url').get(),
            "score": get_css('.bui-review-score__badge::text'),
            "title": get_css('.c-review-block__title::text'),
            "date": get_css('.c-review-block__date::text'),
            "user_name": get_css('.bui-avatar-block__title::text'),
            "user_country": get_css('.bui-avatar-block__subtitle::text'),
            "text": ''.join(review_box.css('.c-review__body ::text').getall()),
            "lang": review_box.css('.c-review__body::attr(lang)').get(),
        })
    return parsed


async def scrape_reviews(hotel_id: str, session) -> List[dict]:
    """scrape all reviews of a hotel"""
    async def scrape_page(page, page_size=25):  # 25 is largest possible page size for this endpoint
        url = "https://www.booking.com/reviewlist.html?" + urlencode(
            {
                "type": "total",
                # we can configure language preference
                "lang": "en-us",
                # we can configure sorting order here, in this case recent reviews are first
                "sort": "f_recent_desc",
                "cc1": "gb",  # this varies by hotel country, e.g in OP's case it would be "co" for columbia.
                "dist": 1,
                "pagename": hotel_id,
                "rows": page_size,
                "offset": page * page_size,
            }
        )
        return await session.get(url)

    first_page = await scrape_page(1)
    total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
    total_pages = max(int(page) for page in total_pages)
    other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])

    results = []
    for response in [first_page, *other_pages]:
        results.extend(parse_reviews(response.text))
    return results

I write more about scraping this endpoint on my blog How to Scrape Booking.com which has more illustrations and videos if more information is needed.

Upvotes: 1

QHarr
QHarr

Reputation: 84465

You could always use a browser as a plan B. Selenium doesn't have this problem

from selenium import webdriver

d = webdriver.Chrome()
d.get('https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75')
titles = [item.text for item in d.find_elements_by_css_selector('.review_item_review_header [itemprop=name]')]
print(titles)

Upvotes: 3

Bitto
Bitto

Reputation: 8225

Servers can be configured to send different responses based on the browser making the request. Adding a User-Agent seems to fix the problem.

import urllib.request
from bs4 import BeautifulSoup
url='https://www.booking.com/reviews/co/hotel/ibis-bogota-museo.es.html?page=1;r_lang=all;rows=75'
req = urllib.request.Request(
    url,
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36',
    }
)

f = urllib.request.urlopen(req)
soup = BeautifulSoup(f.read().decode('utf-8'),'html.parser')
reviews = soup.findAll("li", {"class": "review_item clearfix "})
for review in reviews:
    print(review.find("div", {"class": "review_item_header_content"}).text)

Output:

“Sencillamente placentera”


“It could had been great.”


“will never stay their in the future.”


“Hôtel bien situé.”
...

Upvotes: 4

Related Questions