Improving performance of bad / potentially unnecessary Apply in R

Question

Thanks in advance for the help with this. I'm not sure if I'm using apply wrong, or simply breaking other rules that are slowing down my code. Any help is appreciated.

Overview: I have basketball data where each row is a moment in a basketball game and includes the 10 players on the court, their teams, the game, as well as how many minutes into the game (1 - 40) that row is at. Using this data, I am computing, for each player, the percentage of their team's games that they were on the court for each of the 1 - 40 minutes.

For example, if Joe's team played 20 games, and if in 13 of those games Joe was spotted in the data in the 5th minute of the game, then we would say that joe was spotted on court in the 5th minute of 65% of his team's games. I'm computing this for each player, for each season, for each of the 1-40 minutes, in my not-so-small data, and am running into performance issues. Here is the function I currently have for doing this:

library(dplyr)

# Raw Data Is Play-By-Play Data - Each Row contains stats for a pl (combination of 5 basketball players)
sheets_url <- 'https://docs.google.com/spreadsheets/d/1xmzaF6tpzVpjOmgfwHwFM_JE8LUszofjj25A5P0P21o/export?format=csv&id=1xmzaF6tpzVpjOmgfwHwFM_JE8LUszofjj25A5P0P21o&gid=630752085'
on.ct.data <- httr::content(httr::GET(url = sheets_url))

computeOnCourtByMinutePcts <- function(on.ct.data) {

  # Create Dataframe With Number Of Games Played By Team Each Season
  num.home.team.games <- on.ct.data %>%
    dplyr::group_by(homeTeamId, season) %>%
    dplyr::summarise(count = length(unique(gameId)))

  num.away.team.games <- on.ct.data %>%
    dplyr::group_by(awayTeamId, season) %>%
    dplyr::summarise(count = length(unique(gameId)))

  num.team.games <- num.home.team.games %>%
    dplyr::full_join(num.away.team.games, by = c('homeTeamId'='awayTeamId', 'season'='season')) %>%
    dplyr::mutate(gamesPlayed = rowSums(cbind(count.x, count.y), na.rm = TRUE)) %>%
    dplyr::rename(teamId = homeTeamId) %>%
    dplyr::mutate(season = as.character(season)) %>%
    dplyr::select(teamId, season, gamesPlayed)

  # Create Dataframe With Players By Season - Seems kind of bulky as well
  all.player.season.apperances <- rbind(
    on.ct.data %>% dplyr::select(homeTeamId, onCtHomeId1, season) %>% dplyr::rename(playerId = onCtHomeId1, teamId = homeTeamId),
    on.ct.data %>% dplyr::select(homeTeamId, onCtHomeId2, season) %>% dplyr::rename(playerId = onCtHomeId2, teamId = homeTeamId),
    on.ct.data %>% dplyr::select(homeTeamId, onCtHomeId3, season) %>% dplyr::rename(playerId = onCtHomeId3, teamId = homeTeamId),
    on.ct.data %>% dplyr::select(homeTeamId, onCtHomeId4, season) %>% dplyr::rename(playerId = onCtHomeId4, teamId = homeTeamId),
    on.ct.data %>% dplyr::select(homeTeamId, onCtHomeId5, season) %>% dplyr::rename(playerId = onCtHomeId5, teamId = homeTeamId),
    on.ct.data %>% dplyr::select(awayTeamId, onCtAwayId1, season) %>% dplyr::rename(playerId = onCtAwayId1, teamId = awayTeamId),
    on.ct.data %>% dplyr::select(awayTeamId, onCtAwayId2, season) %>% dplyr::rename(playerId = onCtAwayId2, teamId = awayTeamId),
    on.ct.data %>% dplyr::select(awayTeamId, onCtAwayId3, season) %>% dplyr::rename(playerId = onCtAwayId3, teamId = awayTeamId),
    on.ct.data %>% dplyr::select(awayTeamId, onCtAwayId4, season) %>% dplyr::rename(playerId = onCtAwayId4, teamId = awayTeamId),
    on.ct.data %>% dplyr::select(awayTeamId, onCtAwayId5, season) %>% dplyr::rename(playerId = onCtAwayId5, teamId = awayTeamId)) %>%
    dplyr::distinct(teamId, playerId, season) %>%
    dplyr::filter(!is.na(playerId))

  # For Each Player-Season, Compute Number Of Games On Court at each minute in game - this is the bad Apply
  playing.time.breakdowns <- apply(X = all.player.season.apperances, MARGIN = 1, FUN = function(thisRow) {

    # Set Player / Season Variables
    thisPlayerId = thisRow[2]
    thisSeason = thisRow[3]

    # Filter for each unique minute of each game with this player on court
    on.court.df = on.ct.data %>% 
      dplyr::filter(onCtHomeId1 == thisPlayerId | onCtHomeId2 == thisPlayerId | onCtHomeId3 == thisPlayerId | onCtHomeId4 == thisPlayerId | onCtHomeId5 == thisPlayerId |
                      onCtAwayId1 == thisPlayerId | onCtAwayId2 == thisPlayerId | onCtAwayId3 == thisPlayerId | onCtAwayId4 == thisPlayerId | onCtAwayId5 == thisPlayerId) %>%
      dplyr::filter(season == thisSeason) %>%
      dplyr::filter(!duplicated(paste0(gameId, minNumIntoGame)))

    # Turn This Into a table of minutes on court by game
    thisTable <- table(on.court.df$minNumIntoGame)

    this.player.distrubution.df <- data.frame(
      playerId = thisRow[2],
      teamId = thisRow[1],
      season = thisRow[3],
      minNumIntoGame = as.integer(names(thisTable)),
      numGamesAtMinNum = unname(thisTable) %>% as.vector(),
      stringsAsFactors = FALSE
    )

    # 40 minutes in basketball game, so previous dataframe needs 40 rows
    if(length(which(!(1:40 %in% this.player.distrubution.df$minNumIntoGame))) > 0) {
      zero.mins.played.df <- data.frame(
        playerId = thisRow[2],
        teamId = thisRow[1],
        season = thisRow[3],
        minNumIntoGame = which(!(1:40 %in% this.player.distrubution.df$minNumIntoGame)),
        numGamesAtMinNum = 0,
        stringsAsFactors = FALSE
      )

      this.player.distrubution.df <- plyr::rbind.fill(this.player.distrubution.df, zero.mins.played.df) %>% dplyr::arrange(minNumIntoGame)
    }

    # and return
    return(this.player.distrubution.df)
  })

  # Combine the output into one dataframe
  playing.time.breakdowns <- playing.time.breakdowns %>% do.call("rbind", .)

  # Join on Team-Games played
  playing.time.breakdowns <- playing.time.breakdowns %>%
    dplyr::left_join(num.team.games, by = c("teamId"="teamId", "season"="season")) %>%
    dplyr::rename(teamGamesPlayed = gamesPlayed)

  # Compute pct of games played
  playing.time.breakdowns <- playing.time.breakdowns %>%
    dplyr::mutate(pctMinNumPlayed = round(numGamesAtMinNum / teamGamesPlayed, 3))

  # Handle OT (minNumIntoGame > 40) needs a lower gamesPlayed denominator...

  # And Return
  return(playing.time.breakdowns);
}
on.ct.by.min <- computeOnCourtByMinutePcts(on.ct.data)

In summary, the code does the following:

Creates initial dataframes of all unique player-seasons and team-seasons. For team-seasons, use the pbp data to compute games played.
Apply - for each player-season: (a) find each instance of the player being on court (in one of the 10 onCt columns) for each minute of each game, (b) convert this into a table that shows number of games the player was on court at each of the 1-40 minutes.
Polish up and return. Join a few tables together, and compute the relevant percentages.

Note that it may be easier to follow the apply function by manually running it for one row of all.player.season.appearances. Set thisRow to any row in the dataframe, and run the code line by line for a bit of clarity.

To highlight the slow-code issues, I have uploaded a large chunk of play-by-play / on-court data to google sheets, made it public, and included the link to load the data in the code above. Google Sheets has ~1/2 of my current data, however my total data size is expected to increase by a factor of 10x in the near future, and the code currently takes ~8 minutes to run on my computer. This is a script that needs to be run daily and fairly quickly, and I cannot afford for this one function to take 80 minutes.

It feels like my apply() call is not well done, as if it's no faster than an ordinary for loop. I'm not certain that apply is needed at all, and in fact, I don't think it is. But I have been struggling over the last 24 hours thinking about how to improve this function, with no luck. There must be a better approach here!

Edit: I have a minor bug in the reproducible example, which I am working on currently. Edit2: fixed issue that was creating NAs in the num.team.games dataframe. I just ran the code and it appears to be working correctly. There are ~600 rows of output where the teamId is NA, which is nothing to worry about.

Edit3: It looks like each iteration of the apply takes 0.06 seconds, and there are 5312 rows in the dataframe, which adds up to the ~8 minute run time. Should I be trying to reduce that 0.06 to <0.01, or ditch this whole approach? This is a main Q that I'm not sure about...

Improving performance of bad / potentially unnecessary Apply in R

Answers (1)

Related Questions