R.Hanovre
R.Hanovre

Reputation: 27

Create column with (for each row) maximum value in a subset based on values in that row

I have a data frame df with three columns. Two input columns input1 and input2 and an output column.

I want to create a new column with the maximum value in output within a subset of df which is based on all rows in which input1 and input2 are below or equal to the respective input values in the respective row.

I managed to do that in a for loop easily:

output <- c(1:10)
input1 <- c(5,5,10,10,7,7,20,9,12,18)
input2 <- c(8,6,16,16,8,20,21,12,30,21)

df <- as.data.frame(cbind(output, input1, input2))
  
  
  for (i in 1:nrow(df)){
    df[i,"max"] <- max(df$output[df$input1 <= df$input1[i] &
                                         df$input2 <= df$input2[i]])
  }

However, it is not feasible with my original data with up to 1.000.000 observations.

Is there any option with apply or within data.table to speed up this process?

Upvotes: 0

Views: 85

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

You can use fuzzyjoin::fuzzy_inner_join :

library(dplyr)

fuzzyjoin::fuzzy_inner_join(df, df, 
                            by = c('input1', 'input2'), 
                            match_fun = c(`>=`, `>=`)) %>% 
  group_by(output = output.x, input1 = input1.x, input2 = input2.x) %>%
  summarise(max = max(output.y))

#   output input1 input2   max
#    <dbl>  <dbl>  <dbl> <dbl>
# 1      1      5      8     2
# 2      2      5      6     2
# 3      3     10     16     8
# 4      4     10     16     8
# 5      5      7      8     5
# 6      6      7     20     6
# 7      7     20     21    10
# 8      8      9     12     8
# 9      9     12     30     9
#10     10     18     21    10

Upvotes: 1

Related Questions