Reputation: 10451
I work with sports data in R frequently, and run into the same problem with dplyr::group_by() when attempting to compute summary stats. I have the following dataframe with predicted points in each match of the world cup group stage:
dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal",
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany",
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland",
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France",
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic",
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia",
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland",
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan",
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia",
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark",
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic",
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco",
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica",
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama",
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco",
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany",
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765,
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741,
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816,
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264,
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665,
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932,
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787,
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251,
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294,
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608,
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778,
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503,
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA",
"epB"), class = "data.frame", row.names = c(NA, -48L))
head(worldcup.df)
teamA_name teamB_name epA epB
1 Russia Saudi Arabia 1.6400 1.0940
2 Egypt Uruguay 0.7051 2.0809
3 Morocco Iran 1.1294 1.6016
4 Portugal Spain 1.1116 1.6204
5 France Australia 2.1962 0.6098
6 Argentina Iceland 1.9840 0.7870
I've calculated epA and epB as the expected points for teams A and B in each game, and I'd now like to do a group_by() to compute each of the 32 teams total expected points. What I have historically done is something along these lines:
asAgroupby = worldcup.df %>%
dplyr::group_by(teamA_name) %>%
dplyr::summarise(totPts = sum(epA))
asBgroupby = worldcup.df %>%
dplyr::group_by(teamB_name) %>%
dplyr::summarise(totPts = sum(epB))
outputdf = asAgroupby %>%
dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
dplyr::mutate(totPts = totPts.x + totPts.y) %>%
dplyr::select(-one_of(c('totPts.x', 'totPts.y')))
2 separate group_by() calls, for each of the teamA and teamB column, followed by a left_join and then summing the columns and removing the excess column... yuck. This is as simple a case as this problem comes in as well: exactly 4 columns (2 identifying columns, and 2 stat columns). Since tons of sports data has columns for home / away teams, this is a common problem.
I feel like I need 1 dataframe with 2x the number of rows and 1/2 the number of columns so I can do one group by. Any help at all is appreciated, thanks!!!
Edit: worldcup.df is built from a long %>% of dplyr functions - bonus points if this can be done without creating new variables, but rather just:
worldcup.df %>%
...
Upvotes: 5
Views: 466
Reputation: 764
I've run into this issue as well with some fantasy football stuff. This is how I generally handle it:
df %>% select(team = teamA_name, ep = epA) %>%
bind_rows(df %>% select(team = teamB_name, ep = epB)) %>%
group_by(team) %>%
summarize(ep = sum(ep))
Upvotes: 2
Reputation: 4551
Here's a tidyverse
workflow that works by reformatting the data to a long format. It does keep track of who is in the same game (game_id
), and whether they were the A or B team - if that is useful. (in all fairness, this is the same basic idea as @Emil, just a different workflow to achieve it.)
worldcup.long <- worldcup.df %>%
as_data_frame() %>%
mutate(game_id = 1:n()) %>%
gather(key, value, - game_id) %>%
mutate(
AB = str_extract(key, "A|B"),
key = str_extract(key, "team|ep")
) %>%
spread(key, value,convert = TRUE)
outputdf <- worldcup.long %>%
group_by(team) %>%
summarize(totPts = sum(ep))
Upvotes: 3
Reputation: 16862
Your intuition is correct: you do want a data frame with fewer columns and more rows. dplyr::gather
will do that; in this case, you can do it in 2 piped gather
calls. The first gather
creates rows from the teamA_name
and teamB_name
columns. You can optionally extract the A or B from the entries in that column, to give you just "A" or "B" for each team and its scores. The second gather
does the same, but for the epA
and epB
columns. The key for this gather
gives you the same A or B information that you got from a_or_b
in the previous gather
, so I removed that additional column (select(-pts_a_or_b)
).
library(tidyverse)
df_long <- df %>%
as_tibble() %>%
gather(key = a_or_b, value = team, teamA_name, teamB_name) %>%
mutate(a_or_b = str_extract(a_or_b, "(?<=team)\\w")) %>%
gather(key = pts_a_or_b, value = points, epA, epB) %>%
select(-pts_a_or_b)
df_long
#> # A tibble: 192 x 3
#> a_or_b team points
#> <chr> <chr> <dbl>
#> 1 A Russia 1.64
#> 2 A Egypt 0.705
#> 3 A Morocco 1.13
#> 4 A Portugal 1.11
#> 5 A France 2.20
#> 6 A Argentina 1.98
#> 7 A Peru 1.58
#> 8 A Croatia 1.86
#> 9 A Costa Rica 1.28
#> 10 A Germany 2.09
#> # ... with 182 more rows
Feel free to correct me if there was more to the summary calculation than just adding up the total points for each team; if I'm understanding what you're looking for, you can get it this way:
df_long %>%
group_by(team) %>%
summarise(totPts = sum(points))
#> # A tibble: 32 x 2
#> team totPts
#> <chr> <dbl>
#> 1 Argentina 8.33
#> 2 Australia 8.32
#> 3 Belgium 8.33
#> 4 Brazil 8.51
#> 5 Colombia 8.31
#> 6 Costa Rica 8.33
#> 7 Croatia 8.23
#> 8 Denmark 8.22
#> 9 Egypt 8.24
#> 10 England 8.34
#> # ... with 22 more rows
Upvotes: 1
Reputation: 128
Here is a solution that is fewer lines and doesn't require a join:
df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr> <dbl>
1 Argentina 6.02
2 Australia 2.38
3 Belgium 5.70
4 Brazil 7.03
5 Colombia 5.82
6 Costa Rica 2.64
7 Croatia 4.40
8 Denmark 3.86
9 Egypt 3.44
10 England 5.82
Which is the same as the OP's:
outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr> <dbl>
1 Argentina 6.02
2 Australia 2.38
3 Belgium 5.70
4 Brazil 7.03
5 Colombia 5.82
6 Costa Rica 2.64
7 Croatia 4.40
8 Denmark 3.86
9 Egypt 3.44
10 England 5.82
Upvotes: 2