Scraping Dynamic JSON Data in R

Question

On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:

https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html

STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website. Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.

https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id

@Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.

How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?

Code

library(tidyverse)
library(rvest)
library(dplyr)

df23 <- expand.grid(
  stat_id = c("02568","02675", "101")
) %>% 
  mutate(
    links = paste0(
      "https://www.pgatour.com/stats/detail/",
      stat_id
    )
  ) %>% 
  as_tibble()

get_info <- function(link, stat_id) {
  data <- link %>%
    read_html() %>% 
    html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% 
    html_text() %>%
    jsonlite::fromJSON()
  
  answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
        drop_na(playerName)

# get lists of dataframes into single dataframe, then merge back with original dataframe
    answer2 <- answer$stats
  
  answer2 <- bind_rows(answer2, .id = "column_label") %>%
    select(-color) %>%
    pivot_wider(
      values_from = statValue, 
      names_from = statName) 
  
  #All stats combined and unnested
  stats2 <- dplyr::bind_cols(answer, answer2) 
}

test_stats <- df23 %>%
  mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))

test_stats <- test_stats %>% 
  unnest(everything())

Simplified code courtesy of @Dave2e

#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")

#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% html_text()

#convert from JSON 
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)

#get the main table 
answer <-output$props$pageProps$statDetails$rows

Scraping Dynamic JSON Data in R

Answers (1)

Related Questions