Reputation: 535
This is my website:
url<-https://projects.fivethirtyeight.com/2017-nba-predictions/
As you can see on this question: Web scraping in R?
You can choose different dates and then your table changes.
But my question is different: How can I extract tables from different days?
I am only capable to extract the table related to "Today" date.
I know that I will need to work with the id id="standings-table-wrapper"
everytime I change the date.
But how can I deal with that?
This is how I managed to extract the table regarding the "Today" date:
library(rvest)
library(magrittr)
page <- read_html('https://projects.fivethirtyeight.com/2017-nba-predictions/')
df <- setNames(data.frame(cbind(
html_text(html_nodes(page, 'td.original')),
html_text(html_nodes(page, 'td.carmelo')),
html_text(html_nodes(page, '.change')),
html_text(html_nodes(page, '.team a'))
)),c('elo','carmelo','1wkchange','team'))
print(df)
Any help?
Upvotes: 2
Views: 574
Reputation: 84465
tl;dr;
The page relies on javascript running in the browser to handle the table updates when selecting different dates. This process cannot occur when making a request with rvest
.
I will give a fairly language agnostic description of what is going on and a potential solution. I will then show an implementation of that solution in Python. For R I was unsure on how to do dataframe operations within R to manage the 1-week change calculation. I may one day update this to include full R example too.
Some observations:
rvest
).The data source:
Based on these two observations a quick search of the source documents for the page soon reveals the javascript source and that the entire table is built dynamically.
The source link for this minified (compressed) js file is:
https://projects.fivethirtyeight.com/2017-nba-predictions/js/bundle.js?v=c1d7d294b039ddcc92b494602a5e337b
Table construction and population:
Looking further into this file we can see the instructions for dynamically constructing and populating the table:
Data population is handled by a series of functions which retrieve information from javascript objects within the file e.g.
And helper functions which, for example, round the elo
and car-melo
output:
Understanding how the data is stored:
The data we need is all within function 15; in an array of arrays:
So, an array of forecasts each containing an array of teams info.
If we were to zoom in on one date (i.e. a single item in the outer forecasts array) we would see:
If you look on the right hand side you can see each of the blocks associated with a different team for that particular date.
Retrieving the items and re-constructing a table with the columns of interest:
Sadly, the javascript notation in use here does not lend itself to easy parsing with a json library if one regex's out the array. In Python, at least, we have hjson
which can handle unquoted keys (the var names) but that still ends an EOF
error with trying to parse the info in this case (though it is possible I needed to alter my regex to terminate earlier - a musing for another time). But what we can do is the following:
forecasts
houses):date, elo, car-melo and team
. These lists can then be joined together into a dataframe for that date. The date
field has to be repeated the length of one of the other columns (e.g. elo
) as it only occurs once per block.elo
and car-melo
columns direct) for rounded elo
and car-melo
figures (you can replicate the original js implementation in image further up answer) or implement your own.get_team_dict
for this.GroupBy
on TEAM
and Sort By
on Date desc
.Carmelo desc
within the given date
period (as page does).Py implementation:
import requests, re
import pandas as pd
def get_data(n, response_text):
# n is function number within js e.g. 15: [function(s, _, n){}
# left slightly abstract so can examine various functions in particular js block
pattern = re.compile(f',{n}:(\[.+),{n+1}', re.DOTALL)
func_string = pattern.findall(response_text)[0]
return func_string
def get_team_dict(response_text):
p_teams = re.compile(r'abbrToFull:function\(s\){return{(.*?)}')
team_info = p_teams.findall(response_text)[0].replace('"','')
team_dict = dict(team.split(':') for team in team_info.split(','))
return team_dict
def get_column(block, pattern, length = 0): #p_block_elos, p_block_carmelos, p_block_teams, p_block_date
values = pattern.findall(block)
if length: values = values * length
return values
def get_df_from_processed_blocks(info_blocks, team_dict):
final = pd.DataFrame()
p_block_dates = re.compile(r'last_updated:"(.*?)T')
p_block_teams = re.compile(r'name:"(.*?)"')
p_block_elos = re.compile(r',elo:(.*?),')
p_block_carmelos = re.compile(r'carmelo:(.*?),')
for block in info_blocks:
if block == info_blocks[0]: block = block.split('forecasts:')[-1]
teams = get_column(block, p_block_teams)
teams_fullnames = [team_dict[team] for team in teams]
elos = get_column(block, p_block_elos)
rounded_elos = [round(float(elo)) for elo in elos] # generate rounded values similar to the js func
carmelos = get_column(block, p_block_carmelos)
rounded_carmelos = [round(float(carmelo)) for carmelo in carmelos]
dates = get_column(block, p_block_dates, len(elos)) # determine length of `elos` so can extend single date in block to match length for zipping lists for output
df = pd.DataFrame(list(zip(dates, teams, teams_fullnames, elos, rounded_elos, carmelos, rounded_carmelos)))
if final.empty:
final = df
else:
final = pd.concat([final, df], ignore_index = True)
return final
def get_date_sorted_df_with_change_column(final):
grouped_df = final.groupby(['TEAM (Abbr)'])
grouped_df = grouped_df.apply(lambda _df: _df.sort_values(by=['DATE'], ascending=False ))
grouped_df['1-WEEK CHANGE'] = pd.to_numeric(grouped_df['CARM-ELO'], errors='coerce').fillna(0).astype(int).diff(periods=-1)
# Any other desired changes to columns....
return grouped_df
def write_csv(final, file_name):
final.to_csv(f'C:/Users/User/Desktop/{file_name}.csv', sep=',', encoding='utf-8-sig',index = False, header = True)
def main():
response_text = requests.get('https://projects.fivethirtyeight.com/2017-nba-predictions/js/bundle.js?v=c1d7d294b039ddcc92b494602a5e337b').text
team_dict = get_team_dict(response_text)
p_info_blocks = re.compile(r'last_updated:".+?Z",teams.+?\]', re.DOTALL)
info_blocks = p_info_blocks.findall(get_data(15,response_text))
final = get_df_from_processed_blocks(info_blocks, team_dict)
headers = ['DATE', 'TEAM (Abbr)', 'TEAM (Full)', 'ELO', 'Rounded ELO', 'CARM-ELO', 'Rounded CARM-ELO']
final.columns = headers
grouped_df = get_date_sorted_df_with_change_column(final)
write_csv(grouped_df, 'scores')
if __name__ == "__main__":
main()
Understanding the regexes:
I would suggest pasting the regex pattern into an online regex engine and observing the description. Perhaps test against the source js file in browser or in editor. An example, the regex that generates the blocks, is saved here.
You should then be provided with some sort of explanation. DISCLAIMER: I am not a regex expert. Feel free to suggest improvements.
Comparing the output:
Here is an example comparison of the webpage and the output.
Webpage
Output:
Reading:
Upvotes: 6
Reputation: 1563
Try this:
library(rvest)
library(tidyverse)
page <- read_html('https://projects.fivethirtyeight.com/2017-nba-predictions/')
page %>%
html_nodes("table") %>%
.[[3]] %>%
html_table(., fill = TRUE, header = FALSE) %>%
dplyr::select(-ncol(.)) %>%
dplyr::slice(-c(1:3)) %>%
setNames(., .[1, ]) %>%
dplyr::slice(-1)
Upvotes: 2
Reputation: 494
I have performed a couple of webscraping tasks using RSelenium so let me share a working code for your use case. This is not a direct solution to your question, rather one of the possible methods. Hope this helps!
library(RSelenium)
# Create a Selenium Driver Instance; change the chrome version to your installed instance
rd <- rsDriver(browser = c("chrome"),chromever = "76.0.3809.126", port = 9515L)
# Assign the client to remDr variable. This will be the Chrome window to be used for automation
remDr <- rd$client
remDr$navigate("https://projects.fivethirtyeight.com/2017-nba-predictions/")
# Get the options present under 'Forecast from' dropdown
FDatesRaw <- remDr$findElement(using = "xpath", value = "//div[@id='forecast-selector']")$getElementText()[[1]]
# Simplify the list
FDatesParsed <- unlist(strsplit(FDatesRaw,"\n",))
# Find the index of the option based on the date. Replace 'June 7' with any option from FDatesParsed as you need
option <- match("June 7",FDatesParsed)-1
# Generate xpath to be passed with the selected index
query <- paste0("//div[@id='forecast-selector']//option[@value = '",option,"']")
# Send the selected value to browser
remDr$findElement(using = 'xpath', value = query)$clickElement()
# Get the page source as html
page <- read_html(remDr$getPageSource()[[1]])
# Followed by your code
df <- setNames(data.frame(cbind(
html_text(html_nodes(page, 'td.original')),
html_text(html_nodes(page, 'td.carmelo')),
html_text(html_nodes(page, '.change')),
html_text(html_nodes(page, '.team a'))
)),c('elo','carmelo','1wkchange','team'))
# Close the browser, stop the server and collect garbage
remDr$close()
rd$server$stop()
gc()
Upvotes: -1