Canovice
Canovice

Reputation: 10163

Speed up triple-nested For Loop in R with vectorization

Overview: The following is an important chunk of R code for my basketball stats website. At a high level, the R code converts lineup statistics, where each row represents a unique lineup (a lineup is a combo of 5-players playing together), into on/off statistics, where each row represents a team's overall statistics with a specific player either (a) on-the-court, or (b) off-the-court.

I felt that a small snippet of data would not work for this reproducible example, and so I've uploaded the data to a Google Sheet, and made the sheet public. The reproducible code grabs this CSV data, but you can just as easily download the file by visiting the url.

With all of this said, here is the triple-nested for loop I am working with, which I've done my best to comment clearly:

# Raw Data Is Lineup Data - Each Row contains stats for a single lineup (combination of 5 basketball players)
sheets_url <- 'https://docs.google.com/spreadsheets/d/1GjDbWfZglwdwMwhNemWpX6uWjhmYfpQe-WNcCNE8EK4/export?format=csv&id=1GjDbWfZglwdwMwhNemWpX6uWjhmYfpQe-WNcCNE8EK4&gid=218640693'
raw.lineup.stats <- httr::content(httr::GET(url = sheets_url))

# Will contain the final output
on.off.stats <- c()

all_seasons <- c('1718', '1819')
# Loop each season
for(i in 1:length(all_seasons)) {
  # Filter Lineup Data to include only lineups / stats from this season
  this_season <- all_seasons[i]
  season.lineup.stats <- raw.lineup.stats %>% dplyr::filter(season == this_season)
  all_teams <- unique(season.lineup.stats$teamId)

  # Loop each team that appeared in data for this season
  for(j in 1:length(all_teams)) { 
    # Filter Lineup Data again to include only lineups / stats for this team
    print(paste0(j, ': ', all_teams[j]))
    this_team <- all_teams[j]
    team.season.lineup.stats <- season.lineup.stats %>% dplyr::filter(teamId == this_team)
    players_on_team <- unique(c(team.season.lineup.stats$onCtId1, team.season.lineup.stats$onCtId2, team.season.lineup.stats$onCtId3, team.season.lineup.stats$onCtId4, team.season.lineup.stats$onCtId5))

    # Loop each player on team j
    for(k in 1:length(players_on_team)) {
      # Identify if player is on-court or off-court - is his ID one of the 5
      this_player <- players_on_team[k]
      this.players.teams.lineup.stats <- team.season.lineup.stats %>%
        dplyr::mutate(isOnOrOff = ifelse(onCtId1 == this_player | onCtId2 == this_player | onCtId3 == this_player 
                                         | onCtId4 == this_player | onCtId5 == this_player, 'On Ct', 'Off Ct')) %>%
        dplyr::mutate(playerId = this_player) %>%
        dplyr::select(playerId, isOnOrOff, everything())

      # Convert this team' lineup data into 2 Rows: 1 for team's stats w/ player on-court, and 1 for team's stats w/ player off-court
      this.players.onoff.stats <- this.players.teams.lineup.stats %>%
        dplyr::group_by(playerId, isOnOrOff) %>%
        dplyr::mutate_at(vars(possessions:minutes), .funs = sum) %>%
        dplyr::mutate_at(vars(fieldGoalsMade:oppDefensiveReboundPct), .funs = sum) %>%
        dplyr::filter(!duplicated(isOnOrOff))

      # If player played every minute for his team, nrow(this.players.onoff.stats) == 1. If so, create needed blank off-row
      if(nrow(this.players.onoff.stats) == 1) {
        off.row <- this.players.onoff.stats %>%
          dplyr::ungroup() %>% dplyr::mutate(isOnOrOff = 'Off Ct') %>%
          dplyr::mutate_at(vars(possessions:oppPersonalFoulsPer40), .funs = function(x) return(0)) %>%

          dplyr::group_by(playerId, isOnOrOff)

        this.players.onoff.stats <- this.players.onoff.stats %>% rbind(off.row)
      }

      # And Rbind to the main container
      on.off.stats <- on.off.stats %>% base::rbind(this.players.onoff.stats)
    }
  }
}

Please let me know if there's anything not reproducible about the example. The data fetching, and for-loops, all work on my end. Code flow at a high level (this is all in the comments of the code) does this:

  1. Filter lineup data for single season
  2. Filter lineup data for single team
  3. For each player on team, add indicator column isOnOrOff that specifies if specified player is one of the 5 players in each lineup/row.
  4. Use isOnOrOff column with a group_by to convert the season's-team's lineup stats into on/off stats for the specific player.
  5. If player played every single minute for his team, add a blank 'off' row.
  6. rbind the player's on/off stats onto the output dataframe.

Following the comments when reviewing the code will hopefully make clear how the code is going about converting the data from lineup stats to on/off stats.

Current Speed / Future Data: As far as the current speed, this loop took 1.6 minutes the last time I ran it. With all of the stats (I removed ~300 columns in the example data), the loop takes 3.5 minutes. This is college basketball data, and currently I have only used ~40 teams when building my website. This is going to shortly change to ~350 teams, and with that change, each team will have an additional ~50% more lineups. In total, the size of data will increase by a factor of ~15x.

Given that I'm using a for-loop, I expect at least a 15x slow down, if not more (15x loops but each loop may be slower working with a bigger overall dataset), with the full dataset. I'm also required to call this loop twice each time the code is run, not once. In total, I'm estimating the future run-time at 3.5 * 15x more teams * 2 runs of code == ~105 minutes. This is too long. This code of mine will have to be run daily, and this triple for-loop is only a small part of a much larger script.

Close: any help on this is greatly appreciated. I'm aware that this isn't the easiest for-loop to vectorize, and I plan on bountying this post and any super helpful answers if need be.

EDIT: A quick shared thought on my approach. I felt that I had to use this nested for-loop approach, because the very important group_by has to be done on the team's lineup stats only. I don't care if a player is off-court if the line-up is for an entirely different team / for a season where the player didn't even play college basketball.

EDIT 2: If I could simply run the code inside of the j for-loop for the i seasons and j teams all simultaneously (for each i season, j team, identify players on that team, loop players on the team, compute each player's on/off stats, done), that would probably get the job done, right?

Upvotes: 1

Views: 156

Answers (1)

andrew_reece
andrew_reece

Reputation: 21264

You can get significant speedup by leveraging gather and group_by pivot/aggregate operations.

Starting at raw.lineup.stats, here's a pass which should get you most of the way there, at least in rough strokes. See below for notes.

library(tidyverse)

all_seasons <- c('1718', '1819')

# make a list of unique players per team, per season
players <- raw.lineup.stats %>%
  filter(season %in% all_seasons) %>% 
  gather(position, player, starts_with("onCtId")) %>%
  select(season, teamId, player) %>%
  group_by(season, teamId) %>%
  distinct(player, .keep_all = TRUE) %>%
  ungroup()

# cartesian join with the full df
# use lineupId to determine on/off court (on_ct)
# group_by and aggregate, then use distinct to drop duplicate rows
on_off <- inner_join(
    players, raw.lineup.stats, 
    by = c("season" = "season", "teamId" = "teamId")
  ) %>%
  mutate(on_ct = stringr::str_detect(lineupId, player)) %>% 
  group_by(season, teamId, player, on_ct) %>%
  mutate_at(vars(possessions:minutes, fieldGoalsMade:oppDefensiveReboundPct), 
            list(~sum)) %>%
  ungroup() %>%
  distinct(player, on_ct, .keep_all = TRUE) 

Here are some test comparisons from running your code vs the updated code:

# new code
> on_off[on_off$teamId == "WVU" & on_off$season == "1819", 
+        c("player", "on_ct", "possessions", "minutes")] %>% 
arrange(player) 
                 player on_ct possessions    minutes
1      AndrewGordon4009  TRUE        86.5  46.133333
2      AndrewGordon4009 FALSE       689.0 374.650000
3    BrandonKnappercbd1  TRUE       225.5 123.233333
4    BrandonKnappercbd1 FALSE       550.0 297.550000
5       ChaseHarler8a7e  TRUE       369.5 201.900000
6       ChaseHarler8a7e FALSE       406.0 218.883333
...

# old code
> on.off.stats[on.off.stats$teamId == "WVU" & on.off.stats$season == "1819", 
c("playerId", "isOnOrOff", "possessions", "minutes")] %>% 
arrange(playerId) 
               playerId isOnOrOff possessions    minutes
1      AndrewGordon4009     On Ct        86.5  46.133333
2      AndrewGordon4009    Off Ct       689.0 374.650000
3    BrandonKnappercbd1     On Ct       225.5 123.233333
4    BrandonKnappercbd1    Off Ct       550.0 297.550000
5       ChaseHarler8a7e     On Ct       369.5 201.900000
6       ChaseHarler8a7e    Off Ct       406.0 218.883333
...

Notes:

  • I'm still using magrittr pipes, because I think it's helpful for walking through a problem (and because I think a lot of tidyverse functions are really handy), but you can get some speedup if you want to convert to base R.
  • It appears you have a few bugs in your code which are unrelated to this speedup operation you're asking about - this made it much harder to verify against your output, as sometimes your output was wrong. For example, JamesBolden043b plays for team WVU in season 1718 only, according to raw.lineup.stats, but your on.off.stats final output has him playing in season 1819 as well. I'm also pretty sure your summarise vs mutate commands aren't giving you exactly what you want.
  • If you want player stats for on/off court, per each 5-person on-court configuration, there's an extra layer of grouping, with lineupId, that you'll need to do. (That made more sense to me when I was going through the data, but your call of course.)

I think what's left is syntax adjustments and bug hunting; the main intuition behind this code update should get you most of the way there. One other adjustment: you'll need to add in the rows which are missing in those cases where a player is on the court 100% of the time - but you don't need a for loop for that either.

Upvotes: 1

Related Questions