Extracting values from js source or HTML tags with R?

Question

I'm trying to create a pipeline for my SQL database to contain all of the players who have played in the NBA with their corresponding unique player ID's (as shown in the image below) using this webpage.

How the ID's Manifest Themselves

I was able to successfully do it in python (to create a CSV instead) while manually creating a list with a variable from the stats_ptsd.js file I found in network responses once I inspected the page. I'm not showing this python code because it is not scraping the page but instead referencing this manually copied list.

Network Responses

How the CSV Looks

Now I'm not sure how to scrape the information with R. I've tried a ton of different methods I've seen across the internet, many using the rvest package, but to no avail. I haven't had any meaningful output or error message to show for now. Hopefully, someone has a suggestion for the best way to do this, whether by accessing the .js file or scraping the HTML elements. The xpath for a player's 'a' HTML element with the valid href is shown below.

//*[contains(concat( " ", @class, " " ), concat( " ", "players-list__name", " " )) and (((count(preceding-sibling::*) + 1) = 91) and parent::*)]//a

QHarr · Accepted Answer

The data is coming from a js file you can find in the network tab. You can regex or substring out the javascript dictionary within and parse with a json parser.

library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)

r <- read_html('https://stats.nba.com/js/data/ptsd/stats_ptsd.js') %>%
  html_node('body') %>%
  html_text() %>%
  toString()
data <- str_match_all(r,'stats_ptsd = (.*);')
data <- data.frame(jsonlite::fromJSON(data[[1]][,2])$data$players)
write.csv(data,file="players.csv")

You could also subset and re-order before writing out:

df <- setNames(data[,c("X2","X1")],c("Name","Id"))
write.csv(df,file="players.csv")

References:

https://github.com/yusuzech/r-web-scraping-cheat-sheet/blob/master/README.md#rvest6.1

Extracting values from js source or HTML tags with R?

Answers (1)

Related Questions