nicshah
nicshah

Reputation: 345

Aggregating data with missing values in R

I am working to extend the following example based on the question Using a dataframe to input values to a function and aggregating the outputs

I've added a third team (Mary + Bill), but am now having some problems because the "team3" contains a player that doesn't have any data, so the output is just "Mary"

The other issue I have is that the output of "team1" moves from "Mary + Frank" in the input in "teams" to "Frank + Mary" in the output (that is, the order is around the other way).

This makes it difficult to reconcile the output back to the teams.

I can make team1 and team2 work if I use regex and the start and finish of the teams$V2, but that doesn't help with missing data (ie. Bill).

Modified input:

input = structure(list(V1 = c("Team_2022", "Team_2022", "Team_2022"), V2 = c("Frank", "Mary", "John"), V3 = c("Sydney", "Sydney", "Sydney"), V4 = c(55, 76, 14)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

teams = structure(list(V1 = c("team1", "team2", "team3"), V2 = c("Mary + Frank","Mary + John", "Mary + Bill")), class = "data.frame", row.names = c(NA, -3L))

result = function(data, combination)
{
    purrr::map_df(lapply(combination,function(x){paste(x,collapse = '|')}), function(x){
    df = data[grepl(x,data$V2),] %>% group_by(V3) %>% summarize(V1= paste(V1[1]), V2= paste(V2,collapse = ' + '), V4= sum(V4))
    data = df[,c(2,3,1,4)]
    return(data)
  }
  )
}

apply(teams, 1, function(x) {
  y <- list(str_split(x[2], ' \\+ ', simplify = T))
  result(input, y)
}) %>% do.call("rbind", .)

Output

structure(list(V1 = c("Team_2022", "Team_2022", "Team_2022"), 
    V2 = c("Frank + Mary", "Mary + John", "Mary"), V3 = c("Sydney", 
    "Sydney", "Sydney"), V4 = c(131, 90, 76)), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

Can anyone suggest how I can get to my desired output - either keeps the team1, team2 in the same order so I can join them back afterwards; or joins the team name as part of the input?

output1 = structure(list(V1 = c("Team_2022", "Team_2022", "Team_2022"), 
               V2 = c("Mary + Frank","Mary + John", "Mary + Bill"), V3 = c("Sydney", 
                                                                 "Sydney", "Sydney"), V4 = c(131, 90, 76)), row.names = c(NA, 
                                                                                                                         -3L), class = c("tbl_df", "tbl", "data.frame"))

output2 = structure(list(V1 = c("team1", "team2", "team3"), V2 = c("Team_2022", "Team_2022", "Team_2022"), 
                         V3 = c("Frank + Mary", "John + Mary","Mary + Bill"), V4 = c("Sydney", 
                                                                                     "Sydney", "Sydney"), V5 = c(131, 90, 76)), row.names = c(NA, 
                                                                                                                                             -3L), class = c("tbl_df", "tbl", "data.frame"))

Or even (because I could at least tie the team3 back to missing values):

output3 = structure(list(V1 = c("team1", "team2", "team3"), V2 = c("Team_2022", "Team_2022", "Team_2022"), 
                         V3 = c("Frank + Mary", "John + Mary","Mary"), V4 = c("Sydney",                                                                                    "Sydney", "Sydney"), V5 = c(69, 90, 14)), row.names = c(NA, 
                                                                                                                                             -3L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 0

Views: 31

Answers (1)

Jon Spring
Jon Spring

Reputation: 66415

library(dplyr); library(tidyr)
teams %>%
  separate_rows(V2) %>%
  left_join(input, by = c("V2" = "V2")) %>%
  replace_na(list(V4 = 0)) %>%
  group_by(V1.x) %>% fill(V1.y, V3) %>%
  summarize(V2 = paste(V2, collapse = " + "),
            V3 = first(V3),
            V4 = sum(V4))

Result

# A tibble: 3 × 4
  V1.x  V2           V3        V4
  <chr> <chr>        <chr>  <dbl>
1 team1 Mary + Frank Sydney   131
2 team2 Mary + John  Sydney    90
3 team3 Mary + Bill  Sydney    76

Upvotes: 1

Related Questions