Reputation: 2321
This question is an extension of one of my previous questions. I have two long-form dataframes, small
and big
, that have the same groups (id
) and a combination of numeric and character variables. In big
, the number of rows per group is greater compared to the number of rows per group in small
. My goal is to average rows together in big
so that the number of rows per group matches the number of rows per group in small
as closely as possible.
I have created a reprex below, which gets close but not as close as I think is possible. I believe the issue is that in big
, each group may need its own sum_ref
value (which refers to how many n rows should be averaged together), but I am unsure of how to implement that. Any advice is appreciated.
set.seed(123)
library(tidyverse)
id <- c(rep("101", 10), rep("102", 21), rep("103", 15))
color <- c(rep ("red", 10), rep("blue", 21), rep("green", 15))
time <- c(1:10, 1:21, 1:15)
V1 <- sample(1:3, 10+21+15, replace=TRUE)
V2 <- sample(1:3, 10+21+15, replace=TRUE)
V3 <- sample(1:3, 10+21+15, replace=TRUE)
small <- data.frame(id,color,time,V1,V2,V3) %>%
mutate(time = 1:length(V1)) %>%
select(id, time, everything())
id <- c(rep("101", 32), rep("102", 45), rep("103", 27))
color <- c(rep ("red", 32), rep("blue", 45), rep("green", 27))
time <- c(1:32, 1:45, 1:27)
V1 <- sample(1:3, 32+45+27, replace=TRUE)
V2 <- sample(1:3, 32+45+27, replace=TRUE)
V3 <- sample(1:3, 32+45+27, replace=TRUE)
big <- data.frame(id,color,time,V1,V2,V3) %>%
mutate(time = 1:length(V1)) %>%
select(id, time, everything())
rm(V1,V2,V3,color,id,time)
small_size <- nrow(small)
big_size <- nrow(big)
sum_ref <- big_size/small_size
# `new` should have the same number of rows as `small`
# also for each ID, the number of rows in `small` should equal the number of rows in `new`
new <- big %>%
group_by(id, color, new_time = as.integer(gl(n(), sum_ref, n()))) %>%
summarise(across(starts_with('V'), mean), .groups = 'drop')
print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 53
# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 10
new %>% filter(id == "101") %>% nrow()
#> [1] 16
Upvotes: 1
Views: 69
Reputation: 4456
You are correct: "each group may need its own sum_ref value". My solution to that is creating a variable to storage the sizes of each group in small
:
small_size <- small %>% group_by(id, color) %>% summarise(size = n())
Then, for each group in big
, we create the column of which values should be averaged together. In your code, you did that using as.integer(gl(n(), sum_ref, n()))
, but as sum_ref
is a decimal number, that doesn't assures that this column will go from 1
to the size of the corresponding small
group size, so I made a new version:
seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()
This makes a sequence that goes from 1
, to the small group size stored in small_size
, using the cur_group_id()
to acess the correct entry of the array. This sequence will have size n()
(the big
group size) and will only have intergers because of %>% trunc()
(which does the same as your as.interger()
). There might be a better way to do this, as with my method the last value only apears once. But regardles of the way you choose to make this vector transformation, the essense of the answer is how to make a diferent transformation for each group with small_size$size[cur_group_id()]
.
new <- big %>%
group_by(id, color) %>%
mutate(new_time = seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()) %>%
group_by(new_time, .add = TRUE) %>%
summarise(across(starts_with('V'), mean), .groups = 'drop')
print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 46
# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 16
new %>% filter(id == "101") %>% nrow()
#> [1] 16
Upvotes: 1