Reputation: 47
I am trying to scrape the rating trend data that is displayed in the bottom-left chart of the link below but cannot seem to figure out a way to get to it. I am worried this is because it is embedded as a picture so the data is not accessible but thought I would check.
Added the code I stitched together but I only get the axis values.
Any help would be greatly appreciated.
https://www.glassdoor.com/Reviews/Netflix-Reviews-E11891.htm#trends-overallRating
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
import pandas as pd
from selenium.webdriver.common import action_chains, keys
from selenium.common.exceptions import NoSuchElementException
import numpy as np
import sys
import re
import warnings
options = Options()
options.headless = True
driver = webdriver.Chrome(r'PATH',options=options)
driver.get('https://www.glassdoor.com/Reviews/Netflix-Reviews-E11891.htm#trends-overallRating')
trend_element = driver.find_elements_by_xpath('//*[@id="DesktopTrendChart"]')[0]
trend = trend_element.text
print(trend)
Upvotes: 1
Views: 1742
Reputation: 28565
I was originally having a go at it using BeautifulSoup.
I was able to pull out all the coordinates of the corresponding values (which I did successfully do). Took about an hour or so to find where it was all located, extract it, get into a nice, tidy dataframe.
For the next step, I was going to convert the x and y coordinates to the corresponding x and y labels, then interpolate to create a more granular set of data (which I had not attempted yet). I was anticipating this would take about another hour or so.
I did a little more research prior to doing that and found an interesting article here.
After reading it, and then going back to the orginal problem, was able to do this in a) less line of code, b) without BeautifulSoup, and c) took me about 5-10 minutes to do, and d) I learned something new.
So read over that link, check out the code, and this should get you what you need.
import requests
import json
import pandas as pd
url = 'https://www.glassdoor.co.uk/api/employer/11891-rating.htm?dataType=trend&category=overallRating&locationStr=&jobTitleStr=&filterCurrentEmployee=false'
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
response = se.get(url)
data = json.loads(response.text)
results = pd.DataFrame()
results['date'], results['rating'] = data['dates'], data['employerRatings']
Output:
print (results)
date rating
0 2018/12/30 3.66104
1 2018/12/30 3.66311
2 2018/11/25 3.69785
3 2018/10/28 3.73478
4 2018/9/30 3.68311
5 2018/8/26 3.69093
6 2018/7/29 3.70312
7 2018/6/24 3.74851
8 2018/5/27 3.67543
9 2018/4/29 3.67500
10 2018/3/25 3.62248
11 2018/2/25 3.73467
12 2018/1/28 3.70791
13 2017/12/31 3.72217
14 2017/11/26 3.69733
15 2017/10/29 3.61443
16 2017/9/24 3.47046
17 2017/8/27 3.46511
18 2017/7/30 3.46711
19 2017/6/25 3.48164
20 2017/5/28 3.52925
21 2017/4/30 3.46825
22 2017/3/26 3.46874
23 2017/2/26 3.52620
Upvotes: 4