Reputation: 10451

Difficult aggregation in R, grouping team basketball stats

Thanks in advance for any help or suggestions on this. Here is a shortened example of the dataframe I am working with.

boxscore_stats = structure(list(game_id = c(157046L, 157046L, 157046L, 157046L, 
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 
159151L), team_id = c(116975, 116975, 116975, 116975, 116975, 
116975, 116975, 116975, 116975, 120310, 120310, 120310, 120310, 
120310, 120310, 120310, 120310, 120310, 121910, 121910, 121910, 
121910, 121910, 121910, 121910, 121910, 121910, 121910, 122072, 
122072, 122072, 122072, 122072, 122072, 122072, 122072, 122072, 
122072, 122072, 122072), minutes_played = c(18.76, 14.63, 8, 
16.69, 24.62, 32, 12.79, 5.28, 3.22, 24.35, 10.18, 20.65, 9.59, 
25.08, 14.12, 17.46, 23.15, 15.43, 22.84, 19.27, 21.31, 6.41, 
17.57, 17.4, 17.29, 7.22, 12.09, 17.25, 2.28, 16.87, 6.6, 19.73, 
6.31, 13.25, 26.25, 6.08, 28.71, 11.2, 17.54, 5.17), fieldGoalsMade = c(1L, 
1L, 4L, 1L, 2L, 7L, 1L, 1L, 1L, 4L, 0L, 3L, 1L, 3L, 0L, 6L, 7L, 
1L, 7L, 4L, 5L, 1L, 2L, 6L, 2L, 0L, 1L, 3L, 0L, 1L, 1L, 3L, 0L, 
1L, 11L, 2L, 5L, 1L, 2L, 1L), fieldGoalAttempts = c(8L, 6L, 7L, 
2L, 9L, 16L, 3L, 1L, 2L, 12L, 4L, 12L, 3L, 11L, 4L, 9L, 13L, 
6L, 12L, 10L, 14L, 2L, 6L, 11L, 6L, 2L, 2L, 6L, 0L, 5L, 3L, 10L, 
2L, 3L, 21L, 3L, 17L, 4L, 9L, 2L)), .Names = c("game_id", "team_id", 
"minutes_played", "fieldGoalsMade", "fieldGoalAttempts"), row.names = c(NA, 
40L), class = "data.frame")


head(boxscore_stats)

   game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1   157046  116975          18.76              1                 8
2   157046  116975          14.63              1                 6
3   157046  116975           8.00              4                 7
4   157046  116975          16.69              1                 2
5   157046  116975          24.62              2                 9
6   157046  116975          32.00              7                16
7   157046  116975          12.79              1                 3
8   157046  116975           5.28              1                 1
9   157046  116975           3.22              1                 2
10  157046  120310          24.35              4                12
11  157046  120310          10.18              0                 4
12  157046  120310          20.65              3                12
13  157046  120310           9.59              1                 3
14  157046  120310          25.08              3                11
15  157046  120310          14.12              0                 4
16  157046  120310          17.46              6                 9
17  157046  120310          23.15              7                13
18  157046  120310          15.43              1                 6
19  159151  121910          22.84              7                12
20  159151  121910          19.27              4                10
21  159151  121910          21.31              5                14
22  159151  121910           6.41              1                 2
23  159151  121910          17.57              2                 6
24  159151  121910          17.40              6                11
25  159151  121910          17.29              2                 6
26  159151  121910           7.22              0                 2
27  159151  121910          12.09              1                 2
28  159151  121910          17.25              3                 6
29  159151  122072           2.28              0                 0
30  159151  122072          16.87              1                 5
31  159151  122072           6.60              1                 3
32  159151  122072          19.73              3                10
33  159151  122072           6.31              0                 2
34  159151  122072          13.25              1                 3
35  159151  122072          26.25             11                21
36  159151  122072           6.08              2                 3
37  159151  122072          28.71              5                17
38  159151  122072          11.20              1                 4
39  159151  122072          17.54              2                 9
40  159151  122072           5.17              1                 2

The important things to note about this dataframe is that each game_id corresponds with two team_ids, for the two teams that played in the game. Each game_id is unique to one game of basketball. Each row corresponds with the stats for a player on the team_ids team in that game. The example above has only two games / 4 teams / 40 players, but my full dataframe has hundreds of games, which each team_id showing up many times.

My first aggregation, which I was able to do, was to aggregate everything by team_id. This code got the job done for me for the first aggregation:

boxscore_stats_aggregated = aggregate(boxscore_stats, by = list(boxscore_stats[, 2]), FUN = sum)

which was fairly straightforward. For any team_id, I had aggregated all of their minutes played, all of their fieldGoalsMade, etc. For my next aggregation though, I need to aggregate by team_id again but instead of aggregating a team by their own rows / stats, instead I need to aggregate the rows / stats of their opponents. This answers the question "For any team, how many fieldsGoalsMade did they allow in total to opponents, etc." So in this case, for team_id = 116975, I would want to aggregate all the rows with team_id 120310. Of course next time team_id 116975 appears in my dataframe in a new game, it is likely that they are playing a different opponent, so this aggregation is not as simple as aggregating by team_id 120310.

I think I should be able to use the relationship between the two team_ids being unique to the unique game_ids to make this aggregation possible, but I am struggling with how it could be implemented.

Thanks!

Upvotes: 2

Answers (3)

thus__

Reputation: 486

If you want to isolate single team_ids, I would use the dplyr package.

For example if you wanted to know % of field goals per team I would write something like:

boxscore_stats %>% 
     group_by(team_id) %>% 
     summarize(perc_fg = sum(fieldGoalsMade)/sum(fieldGoalAttempts))

This would give you a new data.frame aggregated by team ID.

Upvotes: 0

majom

Reputation: 8041

Here is approach using data.table:

(1) Read in the data:

# Load package
library(data.table)

# Load your data
boxscore_stats <- fread("row game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1   157046  116975          18.76              1                 8
           2   157046  116975          14.63              1                 6
           3   157046  116975           8.00              4                 7
           4   157046  116975          16.69              1                 2
           5   157046  116975          24.62              2                 9
           6   157046  116975          32.00              7                16
           7   157046  116975          12.79              1                 3
           8   157046  116975           5.28              1                 1
           9   157046  116975           3.22              1                 2
           10  157046  120310          24.35              4                12
           11  157046  120310          10.18              0                 4
           12  157046  120310          20.65              3                12
           13  157046  120310           9.59              1                 3
           14  157046  120310          25.08              3                11
           15  157046  120310          14.12              0                 4
           16  157046  120310          17.46              6                 9
           17  157046  120310          23.15              7                13
           18  157046  120310          15.43              1                 6
           19  159151  121910          22.84              7                12
           20  159151  121910          19.27              4                10
           21  159151  121910          21.31              5                14
           22  159151  121910           6.41              1                 2
           23  159151  121910          17.57              2                 6
           24  159151  121910          17.40              6                11
           25  159151  121910          17.29              2                 6
           26  159151  121910           7.22              0                 2
           27  159151  121910          12.09              1                 2
           28  159151  121910          17.25              3                 6
           29  159151  122072           2.28              0                 0
           30  159151  122072          16.87              1                 5
           31  159151  122072           6.60              1                 3
           32  159151  122072          19.73              3                10
           33  159151  122072           6.31              0                 2
           34  159151  122072          13.25              1                 3
           35  159151  122072          26.25             11                21
           36  159151  122072           6.08              2                 3
           37  159151  122072          28.71              5                17
           38  159151  122072          11.20              1                 4
           39  159151  122072          17.54              2                 9
           40  159151  122072           5.17              1                 2
           ")

(2) Proceed with the actual calculations:

# Aggregate on team-and game level (data.table style)
boxscore_stats_aggregated  <-  boxscore_stats[, lapply(.SD, sum), by = list(game_id, team_id)] 

# Match EVERY team to each opponent, i.e. still two rows per game
# but columns for opponent's performance added.
# Some teams drops out in the dummy data as they opponent data was missing.
merge(boxscore_stats_aggregated, boxscore_stats_aggregated, 
      by="game_id", suffixes = c("", ".opponent"))[team_id!=team_id.opponent,]

output looks like that:

# > output
#    game_id team_id row minutes_played fieldGoalsMade fieldGoalAttempts team_id.opponent row.opponent minutes_played.opponent fieldGoalsMade.opponent fieldGoalAttempts.opponent
# 1: 1413414  116975  45         135.99             19                54           120310          126                  160.01                      25                         74
# 2: 1413414  120310 126         160.01             25                74           116975           45                  135.99                      19                         54

Upvotes: 2

Parfait

Reputation: 107737

And just in case, for OP to consider or future readers below is a base R version with merge() for side by side aggregates of team and opposition by game_id. A staging temp col, gamecount, is needed.

# TEAM AGGREGATION
aggdf <- aggregate(.~game_id + team_id, boxscore_stats, FUN = sum)

# GAME COUNT BY TEAM (TEMP COL USED FOR MERGE/FILTER)
aggdf$gamecount <- sapply(1:nrow(aggdf), function(i) 
                          sum(aggdf[1:i, c("game_id")] == aggdf$game_id[i]))    
# MERGE AND FILTER
mdf <- merge(aggdf, aggdf, by="game_id")
mdf <- mdf[mdf$team_id.x != mdf$team_id.y & mdf$gamecount.x == 1,]
mdf$gamecount.x <- mdf$gamecount.y <- NULL

# RENAME COL AND ROW NAMES
names(mdf)[grepl("\\.x", names(mdf))] <- gsub("\\.x", "", 
                                              names(mdf)[grepl("\\.x", names(mdf))])
names(mdf)[grepl("\\.y", names(mdf))] <- gsub("\\.y", ".opp", 
                                              names(mdf)[grepl("\\.y", names(mdf))])
rownames(mdf) <- 1:nrow(mdf)

#   game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts team_id.opp
# 1  157046  116975         135.99             19                54      120310
# 2  159151  121910         158.65             31                71      122072
#   minutes_played.opp fieldGoalsMade.opp fieldGoalAttempts.opp
# 1             160.01                 25                    74
# 2             159.99                 28                    79

Upvotes: 1

Difficult aggregation in R, grouping team basketball stats

Answers (3)

Related Questions