jrcalabrese
jrcalabrese

Reputation: 2321

R: Average rows together within a group to match a second dataframe

This question is an extension of one of my previous questions. I have two long-form dataframes, small and big, that have the same groups (id) and a combination of numeric and character variables. In big, the number of rows per group is greater compared to the number of rows per group in small. My goal is to average rows together in big so that the number of rows per group matches the number of rows per group in small as closely as possible.

I have created a reprex below, which gets close but not as close as I think is possible. I believe the issue is that in big, each group may need its own sum_ref value (which refers to how many n rows should be averaged together), but I am unsure of how to implement that. Any advice is appreciated.

set.seed(123)
library(tidyverse)

id <- c(rep("101", 10), rep("102", 21), rep("103", 15))
color <- c(rep ("red", 10), rep("blue", 21), rep("green", 15))
time <- c(1:10, 1:21, 1:15) 
V1 <- sample(1:3, 10+21+15, replace=TRUE)
V2 <- sample(1:3, 10+21+15, replace=TRUE)
V3 <- sample(1:3, 10+21+15, replace=TRUE)
small <- data.frame(id,color,time,V1,V2,V3) %>%
  mutate(time = 1:length(V1)) %>%
  select(id, time, everything())

id <- c(rep("101", 32), rep("102", 45), rep("103", 27))
color <- c(rep ("red", 32), rep("blue", 45), rep("green", 27))
time <- c(1:32, 1:45, 1:27) 
V1 <- sample(1:3, 32+45+27, replace=TRUE)
V2 <- sample(1:3, 32+45+27, replace=TRUE)
V3 <- sample(1:3, 32+45+27, replace=TRUE)
big <- data.frame(id,color,time,V1,V2,V3) %>%
  mutate(time = 1:length(V1)) %>%
  select(id, time, everything())
rm(V1,V2,V3,color,id,time)

small_size <- nrow(small)
big_size <- nrow(big)
sum_ref <- big_size/small_size

# `new` should have the same number of rows as `small`
# also for each ID, the number of rows in `small` should equal the number of rows in `new`
new <- big %>% 
  group_by(id, color, new_time = as.integer(gl(n(), sum_ref, n()))) %>% 
  summarise(across(starts_with('V'), mean), .groups = 'drop')

print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 53

# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 10
new %>% filter(id == "101") %>% nrow()
#> [1] 16

Upvotes: 1

Views: 69

Answers (1)

Ricardo Semi&#227;o
Ricardo Semi&#227;o

Reputation: 4456

You are correct: "each group may need its own sum_ref value". My solution to that is creating a variable to storage the sizes of each group in small:

small_size <- small %>% group_by(id, color) %>% summarise(size = n())

Then, for each group in big, we create the column of which values should be averaged together. In your code, you did that using as.integer(gl(n(), sum_ref, n())), but as sum_ref is a decimal number, that doesn't assures that this column will go from 1 to the size of the corresponding small group size, so I made a new version:

seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()

This makes a sequence that goes from 1, to the small group size stored in small_size, using the cur_group_id() to acess the correct entry of the array. This sequence will have size n() (the big group size) and will only have intergers because of %>% trunc() (which does the same as your as.interger()). There might be a better way to do this, as with my method the last value only apears once. But regardles of the way you choose to make this vector transformation, the essense of the answer is how to make a diferent transformation for each group with small_size$size[cur_group_id()].

new <- big %>% 
  group_by(id, color) %>%
  mutate(new_time = seq(1, small_size$size[cur_group_id()], length = n()) %>% trunc()) %>%
  group_by(new_time, .add = TRUE) %>%
  summarise(across(starts_with('V'), mean), .groups = 'drop')

print(nrow(small))
#> [1] 46
print(nrow(new))
#> [1] 46

# for id 101
small %>% filter(id == "101") %>% nrow()
#> [1] 16
new %>% filter(id == "101") %>% nrow()
#> [1] 16

Upvotes: 1

Related Questions