Reputation: 8513
I have a rather large dataset with both long and short data inside: some columns have unique value given a subject and a visit, while other have multiple values.
The short data is duplicated to match the format of the long data, so that for some arbitrary number of columns the data have duplicated rows.
I'm looking for a way to automatically identify these columns and then shorten the dataset.
I already have a working solution but the code is very slow.
Is there a way to make it faster?
Ideally, the solution should also use the tidyverse and be contained in one pipeline.
Here is a reprex:
library(tidyverse)
#overly complicated data generation to match mine
n_id = 200
n_visit = 15
r = 20
N = n_id*n_visit*r
df = tibble(
id=rep(1:n_id, each=n_visit*r),
visit=rep(paste0("v", 1:n_visit), times=n_id*r),
) %>%
mutate(
!!!set_names(rep(NA, 50), paste0("short",1:50)),
!!!set_names(rep(NA, 50), paste0("long",1:50)),
across(starts_with("short"), ~visit),
across(starts_with("long"), ~rnorm(n())),
) %>%
arrange(id, visit)
df %>% summarise(a=length(unique(short1)), .by=c(id, visit)) %>% pull() %>% unique()
#> [1] 1
df %>% summarise(a=length(unique(long1)), .by=c(id, visit)) %>% pull() %>% unique()
#> [1] 20
#expected output
a = df %>%
summarise(across(everything(), ~length(unique(.x))), .by=c(id, visit)) %>%
select(id, visit, where(~all(.x==1))) %>%
names()
df %>%
select(all_of(a)) %>%
distinct()
#> # A tibble: 3,000 × 52
#> id visit short1 short2 short3 short4 short5 short6 short7 short8 short9
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1
#> 2 1 v10 v10 v10 v10 v10 v10 v10 v10 v10 v10
#> 3 1 v11 v11 v11 v11 v11 v11 v11 v11 v11 v11
#> 4 1 v12 v12 v12 v12 v12 v12 v12 v12 v12 v12
#> 5 1 v13 v13 v13 v13 v13 v13 v13 v13 v13 v13
#> 6 1 v14 v14 v14 v14 v14 v14 v14 v14 v14 v14
#> 7 1 v15 v15 v15 v15 v15 v15 v15 v15 v15 v15
#> 8 1 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2
#> 9 1 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3
#> 10 1 v4 v4 v4 v4 v4 v4 v4 v4 v4 v4
#> # ℹ 2,990 more rows
#> # ℹ 41 more variables: short10 <chr>, short11 <chr>, short12 <chr>,
#> # short13 <chr>, short14 <chr>, short15 <chr>, short16 <chr>, short17 <chr>,
#> # short18 <chr>, short19 <chr>, short20 <chr>, short21 <chr>, short22 <chr>,
#> # short23 <chr>, short24 <chr>, short25 <chr>, short26 <chr>, short27 <chr>,
#> # short28 <chr>, short29 <chr>, short30 <chr>, short31 <chr>, short32 <chr>,
#> # short33 <chr>, short34 <chr>, short35 <chr>, short36 <chr>, …
Created on 2024-07-18 with reprex v2.1.0
Upvotes: 0
Views: 60
Reputation: 389135
Although, fast is subjective but I really doubt if dplyr
is the right tool to make anything faster.
I have recently started exploring collapse
package which is also faster in many cases. The syntax is very similar to dplyr
so it is easy to get started.
temp_df <- df |>
fgroup_by(id, visit) |>
fsummarise(across(short1:long50, \(x) fnunique(x)))
get_vars(df, c(TRUE, TRUE, dapply(temp_df[-c(1:2)], \(x) all(x == 1)))) |>
funique()
# A tibble: 3,000 × 52
# id visit short1 short2 short3 short4 short5 short6 short7 short8 short9
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 v1 v1 v1 v1 v1 v1 v1 v1 v1 v1
# 2 1 v10 v10 v10 v10 v10 v10 v10 v10 v10 v10
# 3 1 v11 v11 v11 v11 v11 v11 v11 v11 v11 v11
# 4 1 v12 v12 v12 v12 v12 v12 v12 v12 v12 v12
# 5 1 v13 v13 v13 v13 v13 v13 v13 v13 v13 v13
# 6 1 v14 v14 v14 v14 v14 v14 v14 v14 v14 v14
# 7 1 v15 v15 v15 v15 v15 v15 v15 v15 v15 v15
# 8 1 v2 v2 v2 v2 v2 v2 v2 v2 v2 v2
# 9 1 v3 v3 v3 v3 v3 v3 v3 v3 v3 v3
#10 1 v4 v4 v4 v4 v4 v4 v4 v4 v4 v4
# ℹ 2,990 more rows
# ℹ 41 more variables: short10 <chr>, short11 <chr>, short12 <chr>,
# short13 <chr>, short14 <chr>, short15 <chr>, short16 <chr>, short17 <chr>,
# short18 <chr>, short19 <chr>, short20 <chr>, short21 <chr>, short22 <chr>,
# short23 <chr>, short24 <chr>, short25 <chr>, short26 <chr>, short27 <chr>,
# short28 <chr>, short29 <chr>, short30 <chr>, short31 <chr>, short32 <chr>,
# short33 <chr>, short34 <chr>, short35 <chr>, short36 <chr>, …
# ℹ Use `print(n = ...)` to see more rows
We can compare the solutions now.
library(collapse)
library(data.table)
library(dplyr)
df_copy <- copy(df)
collapse_function <- function(df) {
temp_df <- df |>
fgroup_by(id, visit) |>
fsummarise(across(short1:long50, \(x) fnunique(x)))
get_vars(df, c(TRUE, TRUE, dapply(temp_df[-c(1:2)], \(x) all(x == 1)))) |>
funique()
}
dplyr_function <- function(df) {
a = df %>%
summarise(across(everything(), ~length(unique(.x))), .by=c(id, visit)) %>%
select(id, visit, where(~all(.x==1))) %>%
names()
df %>%
select(all_of(a)) %>%
distinct()
}
data.table_function <- function(df) {
u <- uniqueN(setDT(df)[,id:visit])
a2 <- names(df)[c(TRUE, TRUE, sapply(3:ncol(df), \(i) {i <- c(1:2, i); uniqueN(df[,..i]) == u}))]
unique(df[, ..a2])
}
When comparing against different solutions it fares well against dplyr
solution but falls short than data.table
. This could also because I am still learning this package and don't know the best functions to do the job.
microbenchmark::microbenchmark(
dplyr = dplyr_function(df),
data.table = data.table_function(df_copy),
collapse = collapse_function(df),
times = 10L
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr 2528.2535 2777.8296 2895.0289 2898.4619 2958.9916 3444.3321 10
# data.table 477.6706 498.5118 538.4991 519.4035 529.3705 686.3118 10
# collapse 589.5543 601.7683 688.7736 616.8076 805.4987 933.2496 10
Upvotes: 2
Reputation: 17011
Using data.table::uniqueN
is a bit faster.
library(data.table)
system.time({
u <- uniqueN(setDT(df)[,id:visit])
a2 <- names(df)[c(TRUE, TRUE, sapply(3:ncol(df), \(i) {i <- c(1:2, i); uniqueN(df[,..i]) == u}))]
})
#> user system elapsed
#> 0.61 0.09 0.45
Compare to:
#expected output
system.time(
a <- df %>%
summarise(across(everything(), ~length(unique(.x))), .by=c(id, visit)) %>%
select(id, visit, where(~all(.x==1))) %>%
names()
)
#> user system elapsed
#> 1.86 0.12 1.98
identical(a, a2)
#> [1] TRUE
Upvotes: 2