Reputation: 10163
Overview: The following is an important chunk of R code for my basketball stats website. At a high level, the R code converts lineup statistics, where each row represents a unique lineup (a lineup is a combo of 5-players playing together), into on/off statistics, where each row represents a team's overall statistics with a specific player either (a) on-the-court, or (b) off-the-court.
I felt that a small snippet of data would not work for this reproducible example, and so I've uploaded the data to a Google Sheet, and made the sheet public. The reproducible code grabs this CSV data, but you can just as easily download the file by visiting the url.
With all of this said, here is the triple-nested for loop I am working with, which I've done my best to comment clearly:
# Raw Data Is Lineup Data - Each Row contains stats for a single lineup (combination of 5 basketball players)
sheets_url <- 'https://docs.google.com/spreadsheets/d/1GjDbWfZglwdwMwhNemWpX6uWjhmYfpQe-WNcCNE8EK4/export?format=csv&id=1GjDbWfZglwdwMwhNemWpX6uWjhmYfpQe-WNcCNE8EK4&gid=218640693'
raw.lineup.stats <- httr::content(httr::GET(url = sheets_url))
# Will contain the final output
on.off.stats <- c()
all_seasons <- c('1718', '1819')
# Loop each season
for(i in 1:length(all_seasons)) {
# Filter Lineup Data to include only lineups / stats from this season
this_season <- all_seasons[i]
season.lineup.stats <- raw.lineup.stats %>% dplyr::filter(season == this_season)
all_teams <- unique(season.lineup.stats$teamId)
# Loop each team that appeared in data for this season
for(j in 1:length(all_teams)) {
# Filter Lineup Data again to include only lineups / stats for this team
print(paste0(j, ': ', all_teams[j]))
this_team <- all_teams[j]
team.season.lineup.stats <- season.lineup.stats %>% dplyr::filter(teamId == this_team)
players_on_team <- unique(c(team.season.lineup.stats$onCtId1, team.season.lineup.stats$onCtId2, team.season.lineup.stats$onCtId3, team.season.lineup.stats$onCtId4, team.season.lineup.stats$onCtId5))
# Loop each player on team j
for(k in 1:length(players_on_team)) {
# Identify if player is on-court or off-court - is his ID one of the 5
this_player <- players_on_team[k]
this.players.teams.lineup.stats <- team.season.lineup.stats %>%
dplyr::mutate(isOnOrOff = ifelse(onCtId1 == this_player | onCtId2 == this_player | onCtId3 == this_player
| onCtId4 == this_player | onCtId5 == this_player, 'On Ct', 'Off Ct')) %>%
dplyr::mutate(playerId = this_player) %>%
dplyr::select(playerId, isOnOrOff, everything())
# Convert this team' lineup data into 2 Rows: 1 for team's stats w/ player on-court, and 1 for team's stats w/ player off-court
this.players.onoff.stats <- this.players.teams.lineup.stats %>%
dplyr::group_by(playerId, isOnOrOff) %>%
dplyr::mutate_at(vars(possessions:minutes), .funs = sum) %>%
dplyr::mutate_at(vars(fieldGoalsMade:oppDefensiveReboundPct), .funs = sum) %>%
dplyr::filter(!duplicated(isOnOrOff))
# If player played every minute for his team, nrow(this.players.onoff.stats) == 1. If so, create needed blank off-row
if(nrow(this.players.onoff.stats) == 1) {
off.row <- this.players.onoff.stats %>%
dplyr::ungroup() %>% dplyr::mutate(isOnOrOff = 'Off Ct') %>%
dplyr::mutate_at(vars(possessions:oppPersonalFoulsPer40), .funs = function(x) return(0)) %>%
dplyr::group_by(playerId, isOnOrOff)
this.players.onoff.stats <- this.players.onoff.stats %>% rbind(off.row)
}
# And Rbind to the main container
on.off.stats <- on.off.stats %>% base::rbind(this.players.onoff.stats)
}
}
}
Please let me know if there's anything not reproducible about the example. The data fetching, and for-loops, all work on my end. Code flow at a high level (this is all in the comments of the code) does this:
isOnOrOff
that specifies if specified player is one of the 5 players in each lineup/row.Following the comments when reviewing the code will hopefully make clear how the code is going about converting the data from lineup stats to on/off stats.
Current Speed / Future Data: As far as the current speed, this loop took 1.6 minutes the last time I ran it. With all of the stats (I removed ~300 columns in the example data), the loop takes 3.5 minutes. This is college basketball data, and currently I have only used ~40 teams when building my website. This is going to shortly change to ~350 teams, and with that change, each team will have an additional ~50% more lineups. In total, the size of data will increase by a factor of ~15x.
Given that I'm using a for-loop, I expect at least a 15x slow down, if not more (15x loops but each loop may be slower working with a bigger overall dataset), with the full dataset. I'm also required to call this loop twice each time the code is run, not once. In total, I'm estimating the future run-time at 3.5 * 15x more teams * 2 runs of code == ~105 minutes. This is too long. This code of mine will have to be run daily, and this triple for-loop is only a small part of a much larger script.
Close: any help on this is greatly appreciated. I'm aware that this isn't the easiest for-loop to vectorize, and I plan on bountying this post and any super helpful answers if need be.
EDIT: A quick shared thought on my approach. I felt that I had to use this nested for-loop approach, because the very important group_by has to be done on the team's lineup stats only. I don't care if a player is off-court if the line-up is for an entirely different team / for a season where the player didn't even play college basketball.
EDIT 2: If I could simply run the code inside of the j
for-loop for the i
seasons and j
teams all simultaneously (for each i
season, j
team, identify players on that team, loop players on the team, compute each player's on/off stats, done), that would probably get the job done, right?
Upvotes: 1
Views: 156
Reputation: 21264
You can get significant speedup by leveraging gather
and group_by
pivot/aggregate operations.
Starting at raw.lineup.stats
, here's a pass which should get you most of the way there, at least in rough strokes. See below for notes.
library(tidyverse)
all_seasons <- c('1718', '1819')
# make a list of unique players per team, per season
players <- raw.lineup.stats %>%
filter(season %in% all_seasons) %>%
gather(position, player, starts_with("onCtId")) %>%
select(season, teamId, player) %>%
group_by(season, teamId) %>%
distinct(player, .keep_all = TRUE) %>%
ungroup()
# cartesian join with the full df
# use lineupId to determine on/off court (on_ct)
# group_by and aggregate, then use distinct to drop duplicate rows
on_off <- inner_join(
players, raw.lineup.stats,
by = c("season" = "season", "teamId" = "teamId")
) %>%
mutate(on_ct = stringr::str_detect(lineupId, player)) %>%
group_by(season, teamId, player, on_ct) %>%
mutate_at(vars(possessions:minutes, fieldGoalsMade:oppDefensiveReboundPct),
list(~sum)) %>%
ungroup() %>%
distinct(player, on_ct, .keep_all = TRUE)
Here are some test comparisons from running your code vs the updated code:
# new code
> on_off[on_off$teamId == "WVU" & on_off$season == "1819",
+ c("player", "on_ct", "possessions", "minutes")] %>%
arrange(player)
player on_ct possessions minutes
1 AndrewGordon4009 TRUE 86.5 46.133333
2 AndrewGordon4009 FALSE 689.0 374.650000
3 BrandonKnappercbd1 TRUE 225.5 123.233333
4 BrandonKnappercbd1 FALSE 550.0 297.550000
5 ChaseHarler8a7e TRUE 369.5 201.900000
6 ChaseHarler8a7e FALSE 406.0 218.883333
...
# old code
> on.off.stats[on.off.stats$teamId == "WVU" & on.off.stats$season == "1819",
c("playerId", "isOnOrOff", "possessions", "minutes")] %>%
arrange(playerId)
playerId isOnOrOff possessions minutes
1 AndrewGordon4009 On Ct 86.5 46.133333
2 AndrewGordon4009 Off Ct 689.0 374.650000
3 BrandonKnappercbd1 On Ct 225.5 123.233333
4 BrandonKnappercbd1 Off Ct 550.0 297.550000
5 ChaseHarler8a7e On Ct 369.5 201.900000
6 ChaseHarler8a7e Off Ct 406.0 218.883333
...
Notes:
magrittr
pipes, because I think it's helpful for walking through a problem (and because I think a lot of tidyverse functions are really handy), but you can get some speedup if you want to convert to base R. JamesBolden043b
plays for team WVU
in season 1718
only, according to raw.lineup.stats
, but your on.off.stats
final output has him playing in season 1819
as well. I'm also pretty sure your summarise
vs mutate
commands aren't giving you exactly what you want. lineupId
, that you'll need to do. (That made more sense to me when I was going through the data, but your call of course.)I think what's left is syntax adjustments and bug hunting; the main intuition behind this code update should get you most of the way there. One other adjustment: you'll need to add in the rows which are missing in those cases where a player is on the court 100% of the time - but you don't need a for loop for that either.
Upvotes: 1