Reputation: 10451

In R, group by sports data that has Away and Home teams - a common frustration

I work with sports data in R frequently, and run into the same problem with dplyr::group_by() when attempting to compute summary stats. I have the following dataframe with predicted points in each match of the world cup group stage:

dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal", 
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany", 
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland", 
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France", 
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic", 
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia", 
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland", 
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan", 
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia", 
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark", 
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic", 
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco", 
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica", 
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama", 
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco", 
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany", 
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765, 
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741, 
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816, 
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264, 
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665, 
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932, 
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787, 
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251, 
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294, 
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608, 
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778, 
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503, 
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA", 
"epB"), class = "data.frame", row.names = c(NA, -48L))

head(worldcup.df)
  teamA_name   teamB_name    epA    epB
1     Russia Saudi Arabia 1.6400 1.0940
2      Egypt      Uruguay 0.7051 2.0809
3    Morocco         Iran 1.1294 1.6016
4   Portugal        Spain 1.1116 1.6204
5     France    Australia 2.1962 0.6098
6  Argentina      Iceland 1.9840 0.7870

I've calculated epA and epB as the expected points for teams A and B in each game, and I'd now like to do a group_by() to compute each of the 32 teams total expected points. What I have historically done is something along these lines:

asAgroupby = worldcup.df %>% 
  dplyr::group_by(teamA_name) %>%
  dplyr::summarise(totPts = sum(epA))

asBgroupby = worldcup.df %>% 
  dplyr::group_by(teamB_name) %>%
  dplyr::summarise(totPts = sum(epB))

outputdf = asAgroupby %>%
  dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
  dplyr::mutate(totPts = totPts.x + totPts.y) %>%
  dplyr::select(-one_of(c('totPts.x', 'totPts.y')))

2 separate group_by() calls, for each of the teamA and teamB column, followed by a left_join and then summing the columns and removing the excess column... yuck. This is as simple a case as this problem comes in as well: exactly 4 columns (2 identifying columns, and 2 stat columns). Since tons of sports data has columns for home / away teams, this is a common problem.

I feel like I need 1 dataframe with 2x the number of rows and 1/2 the number of columns so I can do one group by. Any help at all is appreciated, thanks!!!

Edit: worldcup.df is built from a long %>% of dplyr functions - bonus points if this can be done without creating new variables, but rather just:

worldcup.df %>%
  ...

Upvotes: 5

Answers (4)

TBT8

Reputation: 764

I've run into this issue as well with some fantasy football stuff. This is how I generally handle it:

df %>% select(team = teamA_name, ep = epA) %>% 
     bind_rows(df %>% select(team = teamB_name, ep = epB)) %>% 
     group_by(team) %>% 
     summarize(ep = sum(ep))

Upvotes: 2

Melissa Key

Reputation: 4551

Here's a tidyverse workflow that works by reformatting the data to a long format. It does keep track of who is in the same game (game_id), and whether they were the A or B team - if that is useful. (in all fairness, this is the same basic idea as @Emil, just a different workflow to achieve it.)

worldcup.long <- worldcup.df %>% 
  as_data_frame() %>%
  mutate(game_id = 1:n()) %>%
  gather(key, value, - game_id) %>%
  mutate(
    AB = str_extract(key, "A|B"),
    key = str_extract(key, "team|ep")
  ) %>%
  spread(key, value,convert = TRUE) 

outputdf <- worldcup.long %>%
  group_by(team) %>%
  summarize(totPts = sum(ep))

Upvotes: 3

camille

Reputation: 16862

Your intuition is correct: you do want a data frame with fewer columns and more rows. dplyr::gather will do that; in this case, you can do it in 2 piped gather calls. The first gather creates rows from the teamA_name and teamB_name columns. You can optionally extract the A or B from the entries in that column, to give you just "A" or "B" for each team and its scores. The second gather does the same, but for the epA and epB columns. The key for this gather gives you the same A or B information that you got from a_or_b in the previous gather, so I removed that additional column (select(-pts_a_or_b)).

library(tidyverse)

df_long <- df %>%
  as_tibble() %>%
  gather(key = a_or_b, value = team, teamA_name, teamB_name) %>%
  mutate(a_or_b = str_extract(a_or_b, "(?<=team)\\w")) %>%
  gather(key = pts_a_or_b, value = points, epA, epB) %>%
  select(-pts_a_or_b)

df_long
#> # A tibble: 192 x 3
#>    a_or_b team       points
#>    <chr>  <chr>       <dbl>
#>  1 A      Russia      1.64 
#>  2 A      Egypt       0.705
#>  3 A      Morocco     1.13 
#>  4 A      Portugal    1.11 
#>  5 A      France      2.20 
#>  6 A      Argentina   1.98 
#>  7 A      Peru        1.58 
#>  8 A      Croatia     1.86 
#>  9 A      Costa Rica  1.28 
#> 10 A      Germany     2.09 
#> # ... with 182 more rows

Feel free to correct me if there was more to the summary calculation than just adding up the total points for each team; if I'm understanding what you're looking for, you can get it this way:

df_long %>%
  group_by(team) %>%
  summarise(totPts = sum(points))
#> # A tibble: 32 x 2
#>    team       totPts
#>    <chr>       <dbl>
#>  1 Argentina    8.33
#>  2 Australia    8.32
#>  3 Belgium      8.33
#>  4 Brazil       8.51
#>  5 Colombia     8.31
#>  6 Costa Rica   8.33
#>  7 Croatia      8.23
#>  8 Denmark      8.22
#>  9 Egypt        8.24
#> 10 England      8.34
#> # ... with 22 more rows

Upvotes: 1

Emil

Reputation: 128

Here is a solution that is fewer lines and doesn't require a join:

df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))

# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

Which is the same as the OP's:

outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

Upvotes: 2

In R, group by sports data that has Away and Home teams - a common frustration

Answers (4)

Related Questions