tgile512
tgile512

Reputation: 47

Glassdoor Web Scrape With Selenium

I am trying to scrape the rating trend data that is displayed in the bottom-left chart of the link below but cannot seem to figure out a way to get to it. I am worried this is because it is embedded as a picture so the data is not accessible but thought I would check.

Added the code I stitched together but I only get the axis values.

Any help would be greatly appreciated.

https://www.glassdoor.com/Reviews/Netflix-Reviews-E11891.htm#trends-overallRating

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
import pandas as pd
from selenium.webdriver.common import action_chains, keys
from selenium.common.exceptions import NoSuchElementException
import numpy as np
import sys
import re
import warnings

options = Options()
options.headless = True


driver = webdriver.Chrome(r'PATH',options=options)
driver.get('https://www.glassdoor.com/Reviews/Netflix-Reviews-E11891.htm#trends-overallRating')

trend_element = driver.find_elements_by_xpath('//*[@id="DesktopTrendChart"]')[0]
trend = trend_element.text
print(trend)

Upvotes: 1

Views: 1742

Answers (1)

chitown88
chitown88

Reputation: 28565

I was originally having a go at it using BeautifulSoup.

I was able to pull out all the coordinates of the corresponding values (which I did successfully do). Took about an hour or so to find where it was all located, extract it, get into a nice, tidy dataframe.

For the next step, I was going to convert the x and y coordinates to the corresponding x and y labels, then interpolate to create a more granular set of data (which I had not attempted yet). I was anticipating this would take about another hour or so.

I did a little more research prior to doing that and found an interesting article here.

After reading it, and then going back to the orginal problem, was able to do this in a) less line of code, b) without BeautifulSoup, and c) took me about 5-10 minutes to do, and d) I learned something new.

So read over that link, check out the code, and this should get you what you need.

import requests
import json
import pandas as pd

url = 'https://www.glassdoor.co.uk/api/employer/11891-rating.htm?dataType=trend&category=overallRating&locationStr=&jobTitleStr=&filterCurrentEmployee=false'

with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }
    response = se.get(url)

data = json.loads(response.text)

results = pd.DataFrame()
results['date'], results['rating'] = data['dates'], data['employerRatings']

Output:

print (results)
          date  rating
0   2018/12/30  3.66104
1   2018/12/30  3.66311
2   2018/11/25  3.69785
3   2018/10/28  3.73478
4    2018/9/30  3.68311
5    2018/8/26  3.69093
6    2018/7/29  3.70312
7    2018/6/24  3.74851
8    2018/5/27  3.67543
9    2018/4/29  3.67500
10   2018/3/25  3.62248
11   2018/2/25  3.73467
12   2018/1/28  3.70791
13  2017/12/31  3.72217
14  2017/11/26  3.69733
15  2017/10/29  3.61443
16   2017/9/24  3.47046
17   2017/8/27  3.46511
18   2017/7/30  3.46711
19   2017/6/25  3.48164
20   2017/5/28  3.52925
21   2017/4/30  3.46825
22   2017/3/26  3.46874
23   2017/2/26  3.52620

Upvotes: 4

Related Questions