Reputation: 83
I have longitudinal data with >100 rows per subject representing daily observations. I want to collapse columns, by subject ID, into monthly observations (i.e. have multiple rows per ID that are summarizing every 30 rows (days) of data).
How can you specify such groupings of days using dplyr?
Also of note, all subjects have different total number of days
Edit: data sample below
df<-structure(list(ID = structure(c(100087, 100087, 100087, 100087,
100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087,
100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087)), time = structure(c(0, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)),
BMI = structure(c(20.06, 20.06, 20.06, 20.06, 20.06, 20.06,
20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06,
20.06, 20.06, 20.06, 20.06, 20.06)), Dis = structure(c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)),
Drug1 = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1)), Drug2 = structure(c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1))), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1
Views: 115
Reputation: 18561
I would group_by
ID
and a new times
variable where you can specify the bins with time %/% 30
for 30 rows. Since your example data has only a few rows I set this to 5
. Since each respondent has a different amount of times
we need to record first_time
and last_time
to then overwrite times
as x - y
times where x and y are first and last time.
In the across
call you would need to specify the way you want to aggregate the data, below I choose mean
. In case you want to get the mean
of BMI
and the max
value of Drug1
you would need to specify each column in a separate function call.
library(dplyr)
df %>%
group_by(ID, times = time %/% 5) %>%
summarise(across(BMI:Drug2, mean),
time_first = first(time),
time_last = last(time)
) %>%
ungroup() %>%
mutate(times = paste0(time_first, "-", time_last)) %>%
select(-c(time_first, time_last))
#> `summarise()` has grouped output by 'ID'. You can override using the `.groups`
#> argument.
#> # A tibble: 4 × 6
#> ID times BMI Dis Drug1 Drug2
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 100087 0-4 20.1 0 1 1
#> 2 100087 5-9 20.1 0 1 1
#> 3 100087 10-14 20.1 0 1 1
#> 4 100087 15-19 20.1 0 1 1
# OPs data
df <- structure(list(ID = structure(c(100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087, 100087)), time = structure(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)), BMI = structure(c(20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06, 20.06)), Dis = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), Drug1 = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), Drug2 = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1))), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
Created on 2022-09-27 by the reprex package (v0.3.0)
Upvotes: 1