Reputation: 75
I have a list of data grouped by 3 different variables, and one measurement variable as shown below. ID1 = factor, ID2 = int, ID3 = number, varX = number
ID1 ID2 ID3 varX
A 1 0.1 40.0
A 1 0.8 70.5
A 2 0.7 55.0
A 2 0.8 65.0
A 2 1.0 60.0
B 4 0.2 70.0
B 5 0.6 55.7
C 1 0.1 55.0
C 1 0.3 90.0
C 1 0.9 60.0
C 5 0.8 45.5
C 5 0.9 30.0
I want to update each value of varX to be the minimum value grouped by ID1 and ID2, but also subgrouped by ID3 where only rows with an ID3 value greater than or equal to the current row to be updated are considered when determining the minimum.
For example: for ID1 = A, ID2 = 2, ID3 = 0.7, varX would be the minimum of 55.0, 65.0 and 60.0. Whereas for ID1 = A, ID2 = 2, ID3 = 0.8, varX would be the minimum of 65.0 and 60.0.
The resulting table would look like:
ID1 ID2 ID3 varX
A 1 0.1 40.0
A 1 0.8 70.5
A 2 0.7 55.0
A 2 0.8 60.0
A 2 1.0 60.0
B 4 0.2 70.0
B 5 0.6 55.7
C 1 0.1 55.0
C 1 0.3 60.0
C 1 0.9 60.0
C 5 0.8 30.0
C 5 0.9 30.0
I have 36,000 rows of data in this format, so performance is relatively important
Upvotes: 5
Views: 116
Reputation: 66490
Here's a more verbose dplyr
approach that might be fast enough (1 sec to process 1 million rows in your format).
library(dplyr)
df2 <- df %>%
tibble::rowid_to_column() %>% # to use later to put back in original order
group_by(ID1, ID2) %>%
arrange(-ID3) %>% # starting with the largest ID3 within each group and working down...
mutate(varX2 = cummin(varX)) %>% # what's the min varX encountered so far?
ungroup() %>%
arrange(rowid) # put back in original order
Here's the fake data I tested with:
n = 1000000
df <- data_frame(
ID1 = sample(LETTERS[1:26], size = n, replace = T),
ID2 = sample(1:100, size = n, replace = T),
ID3 = sample(0.1*1:10, size = n, replace = T),
varX = rnorm(n, 50, 30))
Upvotes: 1