Reputation: 3
I am learning web scraping on my own and I am trying to scrap reviewer's ratings on Yelp as a practice. Typically, I can use CSS selector or XPath methods to select the contents I am interested in. However, those methods do not work for selecting reviewers' ratings. For instance, on the following page: https://www.yelp.com/user_details_reviews_self?userid=0S6EI51ej5J7dgYz3-O0lA. The CSS selector for the first rating is '.stars_2'. However, if I use this selector in my RSelenium code as follows:
ratings=remDr$findElements('css selector','.stars_2')
ratings=unlist(lapply(ratings, function(x){x$getElementText()}))
I get NULL. I think the reason is that the rating is actually a image. I paste a small part of the page source here:
<div class="review-content">
<div class="review-content">
<div class="biz-rating biz-rating-very-large clearfix">
<div>
<div class="rating-very-large">
<i class="star-img stars_2" title="2.0 star rating">
<img alt="2.0 star rating" class="offscreen" height="303" src="//s3-media4.fl.yelpcdn.com/assets/srv0/yelp_styleguide/c2252a4cd43e/assets/img/stars/stars_map.png" width="84">
</i>
</div>
</div>
Basically, if I can extract the text from class="stat-img stars_2" or title="2.0 star rating" then I am good. Can anyone help me on this? Please, I really want to know.
Upvotes: 0
Views: 2880
Reputation: 20341
What about using regular expressions on the page's html, something like:
>>> import requests
>>> url = 'http://www.yelp.com/user_details_reviews_self?userid=0S6EI51ej5J7dgYz3-O0lA'
>>> html = requests.get(url).text
>>> import re
>>> rating_pattern = re.compile(r'\d.\d star rating">')
>>> for rating in re.findall(rating_pattern, html):
... print(rating)
...
2.0 star rating">
4.0 star rating">
5.0 star rating">
5.0 star rating">
5.0 star rating">
5.0 star rating">
5.0 star rating">
2.0 star rating">
4.0 star rating">
2.0 star rating">
Upvotes: 1
Reputation: 50
Would this satisfy?
source = driver.page_source # gets page source of current page
images = source.split("<img")[1:]
for image in images:
if "star rating" in image:
rating = image.split('''alt="''')[1]
rating = rating.split("star")[0]
rating = float(rating)
print rating
break
Upvotes: 0