self-reference mutate in R

Hellow, I'm trying to build a tibble by a mutate operation that needs a self reference filter.

My objectie is to get the maximum value of a variable for all the available values of that variable that have a datetime greater than the datetime of the calculation row.

The only way I have achieved to handel it is by the following code using a for loop:

for(row in 1:nrow(df)){
  df[row, "result"] <- df %>%
    filter(datetime > df[row, "datetime"]) %>%
    pull(variable) %>%
    max(.)
}

The problem is that it is very slow and takes a long time to run. ¿is there a tydier way to do this, or any more efficient way?

Upvotes: 0

Views: 81

Answers (2)

Jon Spring
Jon Spring

Reputation: 66880

This approach should be pretty efficient. I reverse the data, and by group take the cumulative max, and then take the lag of that. Once back in order, that will show the largest remaining value after the current one, within the current group. Note NA's at the last observation per group, since there are no later observations to take the max of.

set.seed(42)
n = 10
df <- data.frame(datetime = as.POSIXct("2024-08-13 12:00", tz = "UCT") + 
                   runif(n, max = 1E6),
                 variable = sample(1:1E3, n, TRUE),
                 group = sample(letters[1:3], n, TRUE))

df |>
  arrange(group, desc(datetime)) |>
  mutate(future_max = lag(cummax(variable)), .by = group) |>
  arrange(group, datetime)


Result
           
              datetime variable group future_max
1  2024-08-15 01:24:26      532     a        839
2  2024-08-16 19:28:59      839     a        410
3  2024-08-19 12:11:35      165     a        410
4  2024-08-21 02:29:52      410     a        356
5  2024-08-23 02:40:47      356     a         NA
6  2024-08-21 15:51:04      882     b         NA
7  2024-08-20 22:15:45      601     c        622
8  2024-08-22 00:36:28      622     c        303
9  2024-08-24 02:06:46      303     c         24
10 2024-08-24 08:17:55       24     c         NA

It's decently fast, so it's still ~1 sec with n = 10M.

Upvotes: 1

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6529

You can solve your problem as fallow:

# way 1: Using data.table

library(data.table)

setDT(df, key="variable")  # turn df into data.table and sort it by `variable`
df[, result := last(df$variable[df$datetime > datetime]), by=.(datetime)]

# or df[, result := df[datetime>d2, last(variable)], by=.(d2=datetime)]



# way 2: Using dplyr

library(dplyr)

df = arrange(df, variable)
df |> 
  group_by(datetime) |> 
  mutate(result = last(df$variable[df$datetime > first(datetime)])) |> 
  ungroup()

Upvotes: 0

Related Questions