Reputation: 10451
Thanks in advance for any help or suggestions on this. Here is a shortened example of the dataframe I am working with.
boxscore_stats = structure(list(game_id = c(157046L, 157046L, 157046L, 157046L,
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L,
157046L, 157046L, 157046L, 157046L, 157046L, 157046L, 157046L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L, 159151L, 159151L, 159151L, 159151L, 159151L, 159151L,
159151L), team_id = c(116975, 116975, 116975, 116975, 116975,
116975, 116975, 116975, 116975, 120310, 120310, 120310, 120310,
120310, 120310, 120310, 120310, 120310, 121910, 121910, 121910,
121910, 121910, 121910, 121910, 121910, 121910, 121910, 122072,
122072, 122072, 122072, 122072, 122072, 122072, 122072, 122072,
122072, 122072, 122072), minutes_played = c(18.76, 14.63, 8,
16.69, 24.62, 32, 12.79, 5.28, 3.22, 24.35, 10.18, 20.65, 9.59,
25.08, 14.12, 17.46, 23.15, 15.43, 22.84, 19.27, 21.31, 6.41,
17.57, 17.4, 17.29, 7.22, 12.09, 17.25, 2.28, 16.87, 6.6, 19.73,
6.31, 13.25, 26.25, 6.08, 28.71, 11.2, 17.54, 5.17), fieldGoalsMade = c(1L,
1L, 4L, 1L, 2L, 7L, 1L, 1L, 1L, 4L, 0L, 3L, 1L, 3L, 0L, 6L, 7L,
1L, 7L, 4L, 5L, 1L, 2L, 6L, 2L, 0L, 1L, 3L, 0L, 1L, 1L, 3L, 0L,
1L, 11L, 2L, 5L, 1L, 2L, 1L), fieldGoalAttempts = c(8L, 6L, 7L,
2L, 9L, 16L, 3L, 1L, 2L, 12L, 4L, 12L, 3L, 11L, 4L, 9L, 13L,
6L, 12L, 10L, 14L, 2L, 6L, 11L, 6L, 2L, 2L, 6L, 0L, 5L, 3L, 10L,
2L, 3L, 21L, 3L, 17L, 4L, 9L, 2L)), .Names = c("game_id", "team_id",
"minutes_played", "fieldGoalsMade", "fieldGoalAttempts"), row.names = c(NA,
40L), class = "data.frame")
head(boxscore_stats)
game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1 157046 116975 18.76 1 8
2 157046 116975 14.63 1 6
3 157046 116975 8.00 4 7
4 157046 116975 16.69 1 2
5 157046 116975 24.62 2 9
6 157046 116975 32.00 7 16
7 157046 116975 12.79 1 3
8 157046 116975 5.28 1 1
9 157046 116975 3.22 1 2
10 157046 120310 24.35 4 12
11 157046 120310 10.18 0 4
12 157046 120310 20.65 3 12
13 157046 120310 9.59 1 3
14 157046 120310 25.08 3 11
15 157046 120310 14.12 0 4
16 157046 120310 17.46 6 9
17 157046 120310 23.15 7 13
18 157046 120310 15.43 1 6
19 159151 121910 22.84 7 12
20 159151 121910 19.27 4 10
21 159151 121910 21.31 5 14
22 159151 121910 6.41 1 2
23 159151 121910 17.57 2 6
24 159151 121910 17.40 6 11
25 159151 121910 17.29 2 6
26 159151 121910 7.22 0 2
27 159151 121910 12.09 1 2
28 159151 121910 17.25 3 6
29 159151 122072 2.28 0 0
30 159151 122072 16.87 1 5
31 159151 122072 6.60 1 3
32 159151 122072 19.73 3 10
33 159151 122072 6.31 0 2
34 159151 122072 13.25 1 3
35 159151 122072 26.25 11 21
36 159151 122072 6.08 2 3
37 159151 122072 28.71 5 17
38 159151 122072 11.20 1 4
39 159151 122072 17.54 2 9
40 159151 122072 5.17 1 2
The important things to note about this dataframe is that each game_id corresponds with two team_ids, for the two teams that played in the game. Each game_id is unique to one game of basketball. Each row corresponds with the stats for a player on the team_ids team in that game. The example above has only two games / 4 teams / 40 players, but my full dataframe has hundreds of games, which each team_id showing up many times.
My first aggregation, which I was able to do, was to aggregate everything by team_id. This code got the job done for me for the first aggregation:
boxscore_stats_aggregated = aggregate(boxscore_stats, by = list(boxscore_stats[, 2]), FUN = sum)
which was fairly straightforward. For any team_id, I had aggregated all of their minutes played, all of their fieldGoalsMade, etc. For my next aggregation though, I need to aggregate by team_id again but instead of aggregating a team by their own rows / stats, instead I need to aggregate the rows / stats of their opponents. This answers the question "For any team, how many fieldsGoalsMade did they allow in total to opponents, etc." So in this case, for team_id = 116975, I would want to aggregate all the rows with team_id 120310. Of course next time team_id 116975 appears in my dataframe in a new game, it is likely that they are playing a different opponent, so this aggregation is not as simple as aggregating by team_id 120310.
I think I should be able to use the relationship between the two team_ids being unique to the unique game_ids to make this aggregation possible, but I am struggling with how it could be implemented.
Thanks!
Upvotes: 2
Views: 133
Reputation: 486
If you want to isolate single team_ids, I would use the dplyr package.
For example if you wanted to know % of field goals per team I would write something like:
boxscore_stats %>%
group_by(team_id) %>%
summarize(perc_fg = sum(fieldGoalsMade)/sum(fieldGoalAttempts))
This would give you a new data.frame
aggregated by team ID
.
Upvotes: 0
Reputation: 8041
Here is approach using data.table:
(1) Read in the data:
# Load package
library(data.table)
# Load your data
boxscore_stats <- fread("row game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts
1 157046 116975 18.76 1 8
2 157046 116975 14.63 1 6
3 157046 116975 8.00 4 7
4 157046 116975 16.69 1 2
5 157046 116975 24.62 2 9
6 157046 116975 32.00 7 16
7 157046 116975 12.79 1 3
8 157046 116975 5.28 1 1
9 157046 116975 3.22 1 2
10 157046 120310 24.35 4 12
11 157046 120310 10.18 0 4
12 157046 120310 20.65 3 12
13 157046 120310 9.59 1 3
14 157046 120310 25.08 3 11
15 157046 120310 14.12 0 4
16 157046 120310 17.46 6 9
17 157046 120310 23.15 7 13
18 157046 120310 15.43 1 6
19 159151 121910 22.84 7 12
20 159151 121910 19.27 4 10
21 159151 121910 21.31 5 14
22 159151 121910 6.41 1 2
23 159151 121910 17.57 2 6
24 159151 121910 17.40 6 11
25 159151 121910 17.29 2 6
26 159151 121910 7.22 0 2
27 159151 121910 12.09 1 2
28 159151 121910 17.25 3 6
29 159151 122072 2.28 0 0
30 159151 122072 16.87 1 5
31 159151 122072 6.60 1 3
32 159151 122072 19.73 3 10
33 159151 122072 6.31 0 2
34 159151 122072 13.25 1 3
35 159151 122072 26.25 11 21
36 159151 122072 6.08 2 3
37 159151 122072 28.71 5 17
38 159151 122072 11.20 1 4
39 159151 122072 17.54 2 9
40 159151 122072 5.17 1 2
")
(2) Proceed with the actual calculations:
# Aggregate on team-and game level (data.table style)
boxscore_stats_aggregated <- boxscore_stats[, lapply(.SD, sum), by = list(game_id, team_id)]
# Match EVERY team to each opponent, i.e. still two rows per game
# but columns for opponent's performance added.
# Some teams drops out in the dummy data as they opponent data was missing.
merge(boxscore_stats_aggregated, boxscore_stats_aggregated,
by="game_id", suffixes = c("", ".opponent"))[team_id!=team_id.opponent,]
output
looks like that:
# > output
# game_id team_id row minutes_played fieldGoalsMade fieldGoalAttempts team_id.opponent row.opponent minutes_played.opponent fieldGoalsMade.opponent fieldGoalAttempts.opponent
# 1: 1413414 116975 45 135.99 19 54 120310 126 160.01 25 74
# 2: 1413414 120310 126 160.01 25 74 116975 45 135.99 19 54
Upvotes: 2
Reputation: 107737
And just in case, for OP to consider or future readers below is a base R version with merge()
for side by side aggregates of team and opposition by game_id
. A staging temp col, gamecount
, is needed.
# TEAM AGGREGATION
aggdf <- aggregate(.~game_id + team_id, boxscore_stats, FUN = sum)
# GAME COUNT BY TEAM (TEMP COL USED FOR MERGE/FILTER)
aggdf$gamecount <- sapply(1:nrow(aggdf), function(i)
sum(aggdf[1:i, c("game_id")] == aggdf$game_id[i]))
# MERGE AND FILTER
mdf <- merge(aggdf, aggdf, by="game_id")
mdf <- mdf[mdf$team_id.x != mdf$team_id.y & mdf$gamecount.x == 1,]
mdf$gamecount.x <- mdf$gamecount.y <- NULL
# RENAME COL AND ROW NAMES
names(mdf)[grepl("\\.x", names(mdf))] <- gsub("\\.x", "",
names(mdf)[grepl("\\.x", names(mdf))])
names(mdf)[grepl("\\.y", names(mdf))] <- gsub("\\.y", ".opp",
names(mdf)[grepl("\\.y", names(mdf))])
rownames(mdf) <- 1:nrow(mdf)
# game_id team_id minutes_played fieldGoalsMade fieldGoalAttempts team_id.opp
# 1 157046 116975 135.99 19 54 120310
# 2 159151 121910 158.65 31 71 122072
# minutes_played.opp fieldGoalsMade.opp fieldGoalAttempts.opp
# 1 160.01 25 74
# 2 159.99 28 79
Upvotes: 1