Reputation: 99
On pgatour.com/stats I am trying to scrape multiple stats over multiple tournaments over multiple years. Unfortunately, I am struggling to scrape data for past years or tournament ID’s. In the past, PGA’s website looked like:
https://www.pgatour.com/stats/stat.STAT_ID.y.YEAR_ID.eoff.TOURNAMENT_ID.html
STAT_ID, YEAR_ID, and TOURNAMENT_ID would all change as you updated the particular stat, year, and tournament id to correspond with their unique id’s. Because of this, I was able to use a function that sifted through all combinations of stat_id, year_id, and tournament_id to scrape the website. Now the website URL’s don’t change except for the particular stat_id being searched. If I change the tournament or year through dropdowns, the stats will load, but the url remains unchanged. This prevents targeting different tournaments or years.
https://www.pgatour.com/stats/detail/02675 - 02675 being an example stat_id
@Dave2e has been very helpful in showing me that pga uses java and how to access some of the JSON data. I combined his teachings along with my past code to scrape all stats for the most recent tournament. However, I can’t figure out how to get the stats for past years or tournaments. In the JSON str I see that there are id’s for $tournamentId and $year, but I’m uncertain of how to use this info to search for past tournaments and years.
How can I access the tournament and year id's to scrape past data on pgatour.com. Should I be trying to access this data with rselenium opposed to a program like rvest?
Code
library(tidyverse)
library(rvest)
library(dplyr)
df23 <- expand.grid(
stat_id = c("02568","02675", "101")
) %>%
mutate(
links = paste0(
"https://www.pgatour.com/stats/detail/",
stat_id
)
) %>%
as_tibble()
get_info <- function(link, stat_id) {
data <- link %>%
read_html() %>%
html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>%
html_text() %>%
jsonlite::fromJSON()
answer <- data$props$pageProps$statDetails$rows %>%
#NA's in player name stops data from being collected
drop_na(playerName)
# get lists of dataframes into single dataframe, then merge back with original dataframe
answer2 <- answer$stats
answer2 <- bind_rows(answer2, .id = "column_label") %>%
select(-color) %>%
pivot_wider(
values_from = statValue,
names_from = statName)
#All stats combined and unnested
stats2 <- dplyr::bind_cols(answer, answer2)
}
test_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_stats <- test_stats %>%
unnest(everything())
Simplified code courtesy of @Dave2e
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[@id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows
Upvotes: 1
Views: 281
Reputation: 21436
If you take a look at the developer tools (F12 key in your browser) and observe the Network tab when you click on a different year you can see a background request is being made to retrieve that year's data:
It returns a JSON dataset similar to the one in your original post:
To scrape this you need to replicate this GraphQL POST request in your R program. Note that it sends a JSON document with query details which includes tournament codes and the year.
Finally to ensure that your graphql succeeds make sure that you match headers you see in this inspector in your R program. In particular the headers Origin
, Referer
and the X-
prefixed ones:
(you can probably hardcode these)
Upvotes: 2