Canovice
Canovice

Reputation: 10163

Sort values across multiple columns in R with dplyr

Apologies for the not-particularly-clear title - hoping my example below helps. I am working with some sports data, attempting to compute "lineup statistics" for certain grouping of players in the data. Below is an example of the type of data I'm working with (playerInfo), as well as the type of analysis I am attempting to do (groupedInfo):

playerInfo = data.frame(
  lineup = c(1,2,3,4,5,6),
  player1 = c("Bil", "Tom", "Tom", "Nik", "Nik", "Joe"),
  player1id = c("e91", "a27", "a27", "b17", "b17", "3b3"),
  player2 = c("Nik", "Bil", "Nik", "Joe", "Tom", "Tom"),
  player2id = c("b17", "e91", "b17", "3b3", "a27", "a27"),
  player3 = c("Joe", "Joe", "Joe", "Tom", "Joe", "Nik"),
  player3id = c("3b3", "3b3", "3b3", "a27", "3b3", "b17"),
  points = c(6, 8, 3, 12, 36, 2),
  stringsAsFactors = FALSE
)

groupedInfo <- playerInfo %>%
  dplyr::group_by(player1, player2, player3) %>%
  dplyr::summarise(
    lineup_ct = n(),
    total_pts = sum(points)
  )

> groupedInfo
# A tibble: 6 x 5
# Groups:   player1, player2 [?]
  player1 player2 player3 lineup_ct total_pts
  <chr>   <chr>   <chr>       <int>     <dbl>
1 Bil     Nik     Joe             1         6
2 Joe     Tom     Nik             1         2
3 Nik     Joe     Tom             1        12
4 Nik     Tom     Joe             1        36
5 Tom     Bil     Joe             1         8
6 Tom     Nik     Joe             1         3

The goal here is to group_by the 3 players in each row, and then compute some summary statistics (in this simple example, count and sum-of-points) for the different groups. Unfortunately, what dplyr::group_by is missing is the fact that certain groups of players should be the same group of players, if its the same 3 players simply in different columns.

For example, in the dataframe above, rows 3,4,5,6 all have the same 3 players (Nik, Tom, Joe), however because sometimes Nik is player1, and sometimes Nik is player2, etc., the group_by groups them separately.

For clarity, below is an example of the type of results I am seeking to get:

correctPlayerInfo = data.frame(
  lineup = c(1,2,3,4,5,6),
  player1 = c("Bil", "Bil", "Joe", "Joe", "Joe", "Joe"),
  player1id = c("e91", "e91", "3b3", "3b3", "3b3", "3b3"),
  player2 = c("Joe", "Joe", "Nik", "Nik", "Nik", "Nik"),
  player2id = c("3b3", "3b3", "b17", "b17", "b17", "b17"),
  player3 = c("Nik", "Tom", "Tom", "Tom", "Tom", "Tom"),
  player3id = c("b17", "a27", "a27", "a27", "a27", "a27"),
  points = c(6, 8, 3, 12, 36, 2),
  stringsAsFactors = FALSE
)

correctGroupedInfo <- correctPlayerInfo %>%
  dplyr::group_by(player1, player2, player3) %>%
  dplyr::summarise(
    lineup_ct = n(),
    total_pts = sum(points)
  )

> correctGroupedInfo
# A tibble: 3 x 5
# Groups:   player1, player2 [?]
  player1 player2 player3 lineup_ct total_pts
  <chr>   <chr>   <chr>       <int>     <dbl>
1 Bil     Joe     Nik             1         6
2 Bil     Joe     Tom             1         8
3 Joe     Nik     Tom             4        53

In this second example, I have manually sorted the data alphabetically such that player1 < player2 < player3. As a result, when I do the group_by, it accurately groups rows 3-6 into a single grouping.

How can I achieve this programatically? I'm not sure if (a) re-structuring playerInfo into the column-sorted correctPlayerInfo (as I've done above(), or (b) some other approach where group_by automatically identifies that these are the same groups, is best.

I am actively working on this, and will post updates if I can come about to my own solution. Until then, any help with this is greatly appreciated!

Edit: Thus far I've tried something along these lines:

newPlayerInfo <- playerInfo %>%
  dplyr::mutate(newPlayer1 = min(player1, player2, player3)) %>%
  dplyr::mutate(newPlayer3 = max(player1, player2, player3))

... to no avail.

Upvotes: 0

Views: 760

Answers (1)

David Laing
David Laing

Reputation: 36

You could create group IDs that are sorted composites of the players' names (or IDs). For example:

playerInfo %>% 
  mutate(
    group_id = purrr::pmap_chr(
      .l = list(p1 = player1, p2 = player2, p3 = player3),
      .f = function(p1, p2, p3) paste(sort(c(p1, p2, p3)), collapse = "_")
    )
  ) %>% 
  group_by(group_id) %>% 
  summarise(
    lineup_ct = n(),
    total_pts = sum(points)
  )

# A tibble: 3 x 3
  group_id    lineup_ct total_pts
  <chr>           <int>     <dbl>
1 Bil_Joe_Nik         1         6
2 Bil_Joe_Tom         1         8
3 Joe_Nik_Tom         4        53

Upvotes: 2

Related Questions