Box
Box

Reputation: 73

Scraping rating from Tripadvisor

I'm scraping the activities to do in Paris from TripAdvisor (https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html).

The code that I've written works well, but I haven't still found a way to obtain the rating of each activity. The rating in Tripadvisor is represented from 5 rounds, I need to know how many of these rounds are colored.

I obtain nothing in the "rating" field.

Following the code:

wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html") 

import pprint 
detail_tours = [] 
for tour in list_tours:
    url = tour.find_elements_by_css_selector("a")[0].get_attribute("href")
    title = ""
    reviews = ""
    rating = ""
    if(len(tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")) > 0):
      title = tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")[0].text 
    if(len(tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")) > 0):
      reviews = tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")[0].text 
    if(len(tour.find_elements_by_css_selector(".zWXXYhVR")) > 0):
      rating = tour.find_elements_by_css_selector(".zWXXYhVR")[0].text

detail_tours.append({'url': url,
                        'title': title,
                        'reviews': reviews,
                        'rating': rating})

Upvotes: 1

Views: 185

Answers (1)

Phorys
Phorys

Reputation: 379

I would use BeautifulSoup in a way similar to the suggested code. (I would also recommend you study the structure of the html, but seeing the original code I don't think that's necessary.)

import requests
from bs4 import BeautifulSoup
import re

header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"}

resp = requests.get('https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html', headers=header)
if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')
    cards = soup.find_all('div', {'data-automation': 'cardWrapper'})
    for card in cards:
        rating = card.find('svg', {'class': 'zWXXYhVR'})
        match = re.match('Punteggio ([0-9,]+)', rating.attrs['aria-label'])[1]
        print(float(match.replace(',', '.')))

And a small bonus-info, the part in the link preceeded by oa (In the example below: oa60), indicates the starting offset, which runs in 30 result increments - So in case you want to change pages, you can change your link to include oa30, oa60, oa90, etc.: https://www.tripadvisor.it/Attractions-g187147-Activities-c42-oa60-Paris_Ile_de_France.html

Upvotes: 1

Related Questions