DonnyDolio
DonnyDolio

Reputation: 99

Scraping Dynamic JSON Data in R

On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:

https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html

STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website. Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.

https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id

@Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.

How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?

Example of titles

Code

library(tidyverse)
library(rvest)
library(dplyr)

df23 <- expand.grid(
  stat_id = c("02568","02675", "101")
) %>% 
  mutate(
    links = paste0(
      "https://www.pgatour.com/stats/detail/",
      stat_id
    )
  ) %>% 
  as_tibble()

get_info <- function(link, stat_id) {
  data <- link %>%
    read_html() %>% 
    html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% 
    html_text() %>%
    jsonlite::fromJSON()
  
  answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
        drop_na(playerName)

# get lists of dataframes into single dataframe, then merge back with original dataframe
    answer2 <- answer$stats
  
  answer2 <- bind_rows(answer2, .id = "column_label") %>%
    select(-color) %>%
    pivot_wider(
      values_from = statValue, 
      names_from = statName) 
  
  #All stats combined and unnested
  stats2 <- dplyr::bind_cols(answer, answer2) 
}

test_stats <- df23 %>%
  mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))

test_stats <- test_stats %>% 
  unnest(everything())

Simplified code courtesy of @Dave2e

#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")

#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% html_text()

#convert from JSON 
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)

#get the main table 
answer <-output$props$pageProps$statDetails$rows

Upvotes: 1

Views: 281

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:

enter image description here

It returns a JSON dataset similar to the one in your original post:

enter image description here

To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.

Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin, Referer and the X- prefixed ones:

enter image description here

(you can probably hardcode these)

Upvotes: 2

Related Questions