Michael 96
Michael 96

Reputation: 75

update numeric variables for each row of a group as the minimum of a subgroup

I have a list of data grouped by 3 different variables, and one measurement variable as shown below. ID1 = factor, ID2 = int, ID3 = number, varX = number

ID1 ID2 ID3  varX
A   1   0.1  40.0
A   1   0.8  70.5
A   2   0.7  55.0
A   2   0.8  65.0
A   2   1.0  60.0
B   4   0.2  70.0
B   5   0.6  55.7
C   1   0.1  55.0
C   1   0.3  90.0
C   1   0.9  60.0
C   5   0.8  45.5
C   5   0.9  30.0

I want to update each value of varX to be the minimum value grouped by ID1 and ID2, but also subgrouped by ID3 where only rows with an ID3 value greater than or equal to the current row to be updated are considered when determining the minimum.

For example: for ID1 = A, ID2 = 2, ID3 = 0.7, varX would be the minimum of 55.0, 65.0 and 60.0. Whereas for ID1 = A, ID2 = 2, ID3 = 0.8, varX would be the minimum of 65.0 and 60.0.

The resulting table would look like:

ID1 ID2 ID3  varX
A   1   0.1  40.0
A   1   0.8  70.5
A   2   0.7  55.0
A   2   0.8  60.0
A   2   1.0  60.0
B   4   0.2  70.0
B   5   0.6  55.7
C   1   0.1  55.0
C   1   0.3  60.0
C   1   0.9  60.0
C   5   0.8  30.0
C   5   0.9  30.0

I have 36,000 rows of data in this format, so performance is relatively important

Upvotes: 5

Views: 116

Answers (1)

Jon Spring
Jon Spring

Reputation: 66490

Here's a more verbose dplyr approach that might be fast enough (1 sec to process 1 million rows in your format).

library(dplyr)
df2 <- df %>%
  tibble::rowid_to_column() %>%   # to use later to put back in original order
  group_by(ID1, ID2) %>%
  arrange(-ID3) %>%   # starting with the largest ID3 within each group and working down...
  mutate(varX2 = cummin(varX)) %>%   # what's the min varX encountered so far?
  ungroup() %>%
  arrange(rowid)   # put back in original order

Here's the fake data I tested with:

n = 1000000
df <- data_frame(
  ID1 = sample(LETTERS[1:26], size = n, replace = T),
  ID2 = sample(1:100, size = n, replace = T),
  ID3 = sample(0.1*1:10, size = n, replace = T),
  varX = rnorm(n, 50, 30))

Upvotes: 1

Related Questions