Laura
Laura

Reputation: 535

WebScraping in R: dealing with tabs on WebSites

This is my website:

url<-https://projects.fivethirtyeight.com/2017-nba-predictions/

As you can see on this question: Web scraping in R?

You can choose different dates and then your table changes.

But my question is different: How can I extract tables from different days?

I am only capable to extract the table related to "Today" date.

I know that I will need to work with the id id="standings-table-wrapper" everytime I change the date.

But how can I deal with that?

This is how I managed to extract the table regarding the "Today" date:

library(rvest)
library(magrittr)

page <- read_html('https://projects.fivethirtyeight.com/2017-nba-predictions/')
df <- setNames(data.frame(cbind(
  html_text(html_nodes(page, 'td.original')),
  html_text(html_nodes(page, 'td.carmelo')),
  html_text(html_nodes(page, '.change')),
  html_text(html_nodes(page, '.team a'))
)),c('elo','carmelo','1wkchange','team'))

print(df)

Any help?

Upvotes: 2

Views: 574

Answers (3)

QHarr
QHarr

Reputation: 84465

tl;dr;

The page relies on javascript running in the browser to handle the table updates when selecting different dates. This process cannot occur when making a request with rvest.

I will give a fairly language agnostic description of what is going on and a potential solution. I will then show an implementation of that solution in Python. For R I was unsure on how to do dataframe operations within R to manage the 1-week change calculation. I may one day update this to include full R example too.


Some observations:

  • Observation: If you disable javascript in the browser and reload the page you will see the table no longer updates.
    • Conclusion: In order to get data for different dates javascript is running on the page (javascript does not run when you make a request with rvest).
  • Observation: If you monitor the web traffic via F12 dev tools, whilst making different date selections, no additional traffic is generated from the page due to date selections.
    • Conclusion: All data required for updating the table is present upon page load and the javascript driving it is present.

The data source:

Based on these two observations a quick search of the source documents for the page soon reveals the javascript source and that the entire table is built dynamically.

enter image description here

The source link for this minified (compressed) js file is:

https://projects.fivethirtyeight.com/2017-nba-predictions/js/bundle.js?v=c1d7d294b039ddcc92b494602a5e337b


Table construction and population:

Looking further into this file we can see the instructions for dynamically constructing and populating the table:

enter image description here

Data population is handled by a series of functions which retrieve information from javascript objects within the file e.g.

And helper functions which, for example, round the elo and car-melo output:

enter image description here


Understanding how the data is stored:

The data we need is all within function 15; in an array of arrays:

enter image description here

So, an array of forecasts each containing an array of teams info.

If we were to zoom in on one date (i.e. a single item in the outer forecasts array) we would see:

enter image description here

If you look on the right hand side you can see each of the blocks associated with a different team for that particular date.


Retrieving the items and re-constructing a table with the columns of interest:

Sadly, the javascript notation in use here does not lend itself to easy parsing with a json library if one regex's out the array. In Python, at least, we have hjson which can handle unquoted keys (the var names) but that still ends an EOF error with trying to parse the info in this case (though it is possible I needed to alter my regex to terminate earlier - a musing for another time). But what we can do is the following:

  • Make a request for that file and use its contents as a data source
  • Grab the string associated with function 15
  • Use regex to generate blocks by date (i.e. the array of items grouped by date that the javascript var forecasts houses):

enter image description here

  • Loop those blocks and extract lists of matches (by regex again) for date, elo, car-melo and team. These lists can then be joined together into a dataframe for that date. The date field has to be repeated the length of one of the other columns (e.g. elo) as it only occurs once per block.
  • Add in additional columns (or perform on the elo and car-melo columns direct) for rounded elo and car-melo figures (you can replicate the original js implementation in image further up answer) or implement your own.
  • Optionally add in a column so you have both abbreviated and long names for team. In my Python implementation I use a helper function get_team_dict for this.
  • You then need to generate the 1-week change values which means taking the final dataframe and performing a GroupBy on TEAM and Sort By on Date desc.
  • Perform a diff calculation on the groups between the current and next rows and generate an output column from the result.
  • Potentially sort the grouped object on Carmelo desc within the given date period (as page does).
  • Do something with result e.g. write out to CSV.

Py implementation:

import requests, re
import pandas as pd

def get_data(n, response_text): 
    # n is function number within js e.g.  15: [function(s, _, n){}  
    # left slightly abstract so can examine various functions in particular js block    
    pattern = re.compile(f',{n}:(\[.+),{n+1}', re.DOTALL)
    func_string = pattern.findall(response_text)[0]
    return func_string

def get_team_dict(response_text):
    p_teams = re.compile(r'abbrToFull:function\(s\){return{(.*?)}')
    team_info = p_teams.findall(response_text)[0].replace('"','')
    team_dict = dict(team.split(':') for team in team_info.split(','))
    return team_dict

def get_column(block, pattern, length = 0):  #p_block_elos, p_block_carmelos, p_block_teams, p_block_date
    values = pattern.findall(block)
    if length: values = values * length
    return values

def get_df_from_processed_blocks(info_blocks, team_dict):    
    final = pd.DataFrame()
    
    p_block_dates = re.compile(r'last_updated:"(.*?)T')
    p_block_teams = re.compile(r'name:"(.*?)"')
    p_block_elos = re.compile(r',elo:(.*?),')
    p_block_carmelos = re.compile(r'carmelo:(.*?),')

    for block in info_blocks:

        if block == info_blocks[0]: block = block.split('forecasts:')[-1]

        teams = get_column(block, p_block_teams)
        teams_fullnames = [team_dict[team] for team in teams]

        elos = get_column(block, p_block_elos)
        rounded_elos = [round(float(elo)) for elo in elos] # generate rounded values similar to the js func

        carmelos = get_column(block, p_block_carmelos)
        rounded_carmelos = [round(float(carmelo)) for carmelo in carmelos]

        dates = get_column(block, p_block_dates, len(elos)) # determine length of `elos` so can extend single date in block to match length for zipping lists for output
        df = pd.DataFrame(list(zip(dates, teams, teams_fullnames, elos, rounded_elos, carmelos, rounded_carmelos)))

        if final.empty:
            final = df
        else: 
            final = pd.concat([final, df], ignore_index = True)

    return final

def get_date_sorted_df_with_change_column(final):
    grouped_df = final.groupby(['TEAM (Abbr)'])
    grouped_df = grouped_df.apply(lambda _df: _df.sort_values(by=['DATE'], ascending=False ))
    grouped_df['1-WEEK CHANGE'] = pd.to_numeric(grouped_df['CARM-ELO'], errors='coerce').fillna(0).astype(int).diff(periods=-1)
    # Any other desired changes to columns....
    return grouped_df

def write_csv(final, file_name): 
    final.to_csv(f'C:/Users/User/Desktop/{file_name}.csv', sep=',', encoding='utf-8-sig',index = False, header = True)

def main():   
    
    response_text = requests.get('https://projects.fivethirtyeight.com/2017-nba-predictions/js/bundle.js?v=c1d7d294b039ddcc92b494602a5e337b').text   
    
    team_dict = get_team_dict(response_text)
    
    p_info_blocks = re.compile(r'last_updated:".+?Z",teams.+?\]', re.DOTALL)
    info_blocks = p_info_blocks.findall(get_data(15,response_text))    
    final = get_df_from_processed_blocks(info_blocks, team_dict)
    headers = ['DATE', 'TEAM (Abbr)', 'TEAM (Full)', 'ELO', 'Rounded ELO', 'CARM-ELO', 'Rounded CARM-ELO']
    final.columns = headers
    
    grouped_df = get_date_sorted_df_with_change_column(final)
    
    write_csv(grouped_df, 'scores')
    
if __name__ == "__main__":

    main()

Understanding the regexes:

I would suggest pasting the regex pattern into an online regex engine and observing the description. Perhaps test against the source js file in browser or in editor. An example, the regex that generates the blocks, is saved here.

You should then be provided with some sort of explanation. DISCLAIMER: I am not a regex expert. Feel free to suggest improvements.

enter image description here


Comparing the output:

Here is an example comparison of the webpage and the output.

Webpage

enter image description here

Output:

enter image description here


Reading:

  1. Regex

Upvotes: 6

Mislav
Mislav

Reputation: 1563

Try this:

library(rvest)
library(tidyverse)

    page <- read_html('https://projects.fivethirtyeight.com/2017-nba-predictions/')
    page %>% 
      html_nodes("table") %>% 
      .[[3]] %>% 
      html_table(., fill = TRUE, header = FALSE) %>% 
      dplyr::select(-ncol(.)) %>% 
      dplyr::slice(-c(1:3)) %>% 
      setNames(., .[1, ]) %>% 
      dplyr::slice(-1)

Upvotes: 2

FAlonso
FAlonso

Reputation: 494

I have performed a couple of webscraping tasks using RSelenium so let me share a working code for your use case. This is not a direct solution to your question, rather one of the possible methods. Hope this helps!

library(RSelenium)

# Create a Selenium Driver Instance; change the chrome version to your installed instance

rd <- rsDriver(browser = c("chrome"),chromever = "76.0.3809.126", port = 9515L)

# Assign the client to remDr variable. This will be the Chrome window to be used for automation

remDr <- rd$client

remDr$navigate("https://projects.fivethirtyeight.com/2017-nba-predictions/")

# Get the options present under 'Forecast from' dropdown
FDatesRaw <- remDr$findElement(using = "xpath", value = "//div[@id='forecast-selector']")$getElementText()[[1]]

# Simplify the list
FDatesParsed <- unlist(strsplit(FDatesRaw,"\n",))

# Find the index of the option based on the date. Replace 'June 7' with any option from FDatesParsed as you need
option <- match("June 7",FDatesParsed)-1

# Generate xpath to be passed with the selected index
query <-  paste0("//div[@id='forecast-selector']//option[@value = '",option,"']")

# Send the selected value to browser
remDr$findElement(using = 'xpath', value = query)$clickElement()

# Get the page source as html
page <- read_html(remDr$getPageSource()[[1]])

# Followed by your code

df <- setNames(data.frame(cbind(
    html_text(html_nodes(page, 'td.original')),
    html_text(html_nodes(page, 'td.carmelo')),
    html_text(html_nodes(page, '.change')),
    html_text(html_nodes(page, '.team a'))
)),c('elo','carmelo','1wkchange','team'))

# Close the browser, stop the server and collect garbage
remDr$close()
rd$server$stop()
gc() 

Upvotes: -1

Related Questions