Reputation: 597
The code below does exactly what I want it to do. Takes a data frame with multiple grouping constraints (ie. id and date) and returns 1 row for each with different functions applied to different columns. In this case, collapsing to character and taking the maximum of the column by grouping.
The issue is scaling this approach to much bigger data frames with 100k rows and groups. How can I reach the same conclusion in a more efficient manner? If this is possible in a tidyverse structure this is preferred (while not preferred, data.table solutions will also be accepted). I want to keep the flexibility to specify the different cols to be included in the function calls, as well as add additional summary calls (e.g. summarise_at(sum of values)).
data <- tibble(id = c(1,1,1,2,3,4,5,5,6,6,6), date = dmy("01/01/2020"), var1 = 1:11, var2 = 12:22, var3 = 1:11)
data %>%
group_by(id, date) %>%
{data.frame(
summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
summarise(., count = n(), .groups = "keep") # return count of cases in group
)} %>%
select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
as_tibble()
# A tibble: 6 x 6
id date var1 var2 var3 count
<dbl> <date> <chr> <chr> <int> <int>
1 1 2020-01-01 1 AND 2 AND 3 12 AND 13 AND 14 3 3
2 2 2020-01-01 4 15 4 1
3 3 2020-01-01 5 16 5 1
4 4 2020-01-01 6 17 6 1
5 5 2020-01-01 7 AND 8 18 AND 19 8 2
6 6 2020-01-01 9 AND 10 AND 11 20 AND 21 AND 22 11 3
Links to this question as well. How can I use summarise_at to apply different functions to different columns?
Upvotes: 1
Views: 311
Reputation: 8844
Here is the tidyverse
approach adapted from @MichaelDewar's answer, it's tidier but I don't think there is any real improvement in the efficiency. Besides, a dataframe with 100k rows is not that big a deal in my opinion. I think a tidyverse
solution is just fine.
library(dplyr)
data %>%
group_by(id, date) %>%
summarise(
across(c(var1, var2), ~paste(unique(.), collapse = " AND ")),
across(var3, max, na.rm = T),
count = n(), .groups = "keep"
)
But if you really want to improve the efficiency, perhaps try this data.table solution
library(data.table)
setDT(data)[, c(
lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")),
list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]
Benchmark
set.seed(2020)
data2 <- data[sample.int(nrow(data), 1e5, T), ]
data22 <- data.table::copy(data2)
f1 <-
. %>%
group_by(id, date) %>%
{data.frame(
summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
summarise(., count = n(), .groups = "keep") # return count of cases in group
)} %>%
select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
as_tibble()
f2 <-
. %>%
group_by(id, date) %>%
summarise(
across(c(var1, var2), ~paste(unique(.), collapse = " AND ")),
across(var3, max, na.rm = T),
count = n(),
.groups = "keep"
)
f3 <- function(dt) {
setDT(dt)[, c(
lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")),
list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]
}
microbenchmark::microbenchmark(f1(data2), f2(data2), f3(data22))
Result
Unit: milliseconds
expr min lq mean median uq max neval cld
f1(data2) 19.6730 20.27990 20.841344 20.50850 20.85045 29.2799 100 c
f2(data2) 13.5455 14.09240 14.705967 14.34585 14.64625 20.5914 100 b
f3(data22) 6.9186 7.80615 8.598227 8.32035 8.68040 15.8358 100 a
Upvotes: 4
Reputation: 3258
Use summarize(across())
:
paste_distinct <- function(list){
list %>% unique %>% sort %>% paste(collapse = " AND ")
}
data %>%
group_by(id, date) %>%
summarize(across(c(var1,var2), paste_distinct),
across(var3, max, na.rm = TRUE),
count = n())
Upvotes: 2