Reputation: 10163
Apologies for the not-particularly-clear title - hoping my example below helps. I am working with some sports data, attempting to compute "lineup statistics" for certain grouping of players in the data. Below is an example of the type of data I'm working with (playerInfo)
, as well as the type of analysis I am attempting to do (groupedInfo)
:
playerInfo = data.frame(
lineup = c(1,2,3,4,5,6),
player1 = c("Bil", "Tom", "Tom", "Nik", "Nik", "Joe"),
player1id = c("e91", "a27", "a27", "b17", "b17", "3b3"),
player2 = c("Nik", "Bil", "Nik", "Joe", "Tom", "Tom"),
player2id = c("b17", "e91", "b17", "3b3", "a27", "a27"),
player3 = c("Joe", "Joe", "Joe", "Tom", "Joe", "Nik"),
player3id = c("3b3", "3b3", "3b3", "a27", "3b3", "b17"),
points = c(6, 8, 3, 12, 36, 2),
stringsAsFactors = FALSE
)
groupedInfo <- playerInfo %>%
dplyr::group_by(player1, player2, player3) %>%
dplyr::summarise(
lineup_ct = n(),
total_pts = sum(points)
)
> groupedInfo
# A tibble: 6 x 5
# Groups: player1, player2 [?]
player1 player2 player3 lineup_ct total_pts
<chr> <chr> <chr> <int> <dbl>
1 Bil Nik Joe 1 6
2 Joe Tom Nik 1 2
3 Nik Joe Tom 1 12
4 Nik Tom Joe 1 36
5 Tom Bil Joe 1 8
6 Tom Nik Joe 1 3
The goal here is to group_by the 3 players in each row, and then compute some summary statistics (in this simple example, count and sum-of-points) for the different groups. Unfortunately, what dplyr::group_by
is missing is the fact that certain groups of players should be the same group of players, if its the same 3 players simply in different columns.
For example, in the dataframe above, rows 3,4,5,6 all have the same 3 players (Nik, Tom, Joe), however because sometimes Nik is player1, and sometimes Nik is player2, etc., the group_by groups them separately.
For clarity, below is an example of the type of results I am seeking to get:
correctPlayerInfo = data.frame(
lineup = c(1,2,3,4,5,6),
player1 = c("Bil", "Bil", "Joe", "Joe", "Joe", "Joe"),
player1id = c("e91", "e91", "3b3", "3b3", "3b3", "3b3"),
player2 = c("Joe", "Joe", "Nik", "Nik", "Nik", "Nik"),
player2id = c("3b3", "3b3", "b17", "b17", "b17", "b17"),
player3 = c("Nik", "Tom", "Tom", "Tom", "Tom", "Tom"),
player3id = c("b17", "a27", "a27", "a27", "a27", "a27"),
points = c(6, 8, 3, 12, 36, 2),
stringsAsFactors = FALSE
)
correctGroupedInfo <- correctPlayerInfo %>%
dplyr::group_by(player1, player2, player3) %>%
dplyr::summarise(
lineup_ct = n(),
total_pts = sum(points)
)
> correctGroupedInfo
# A tibble: 3 x 5
# Groups: player1, player2 [?]
player1 player2 player3 lineup_ct total_pts
<chr> <chr> <chr> <int> <dbl>
1 Bil Joe Nik 1 6
2 Bil Joe Tom 1 8
3 Joe Nik Tom 4 53
In this second example, I have manually sorted the data alphabetically such that player1 < player2 < player3. As a result, when I do the group_by, it accurately groups rows 3-6 into a single grouping.
How can I achieve this programatically? I'm not sure if (a) re-structuring playerInfo into the column-sorted correctPlayerInfo (as I've done above(), or (b) some other approach where group_by automatically identifies that these are the same groups, is best.
I am actively working on this, and will post updates if I can come about to my own solution. Until then, any help with this is greatly appreciated!
Edit: Thus far I've tried something along these lines:
newPlayerInfo <- playerInfo %>%
dplyr::mutate(newPlayer1 = min(player1, player2, player3)) %>%
dplyr::mutate(newPlayer3 = max(player1, player2, player3))
... to no avail.
Upvotes: 0
Views: 760
Reputation: 36
You could create group IDs that are sorted composites of the players' names (or IDs). For example:
playerInfo %>%
mutate(
group_id = purrr::pmap_chr(
.l = list(p1 = player1, p2 = player2, p3 = player3),
.f = function(p1, p2, p3) paste(sort(c(p1, p2, p3)), collapse = "_")
)
) %>%
group_by(group_id) %>%
summarise(
lineup_ct = n(),
total_pts = sum(points)
)
# A tibble: 3 x 3
group_id lineup_ct total_pts
<chr> <int> <dbl>
1 Bil_Joe_Nik 1 6
2 Bil_Joe_Tom 1 8
3 Joe_Nik_Tom 4 53
Upvotes: 2