Reputation: 27
I have a data frame df
with three columns. Two input columns input1
and input2
and an output
column.
I want to create a new column with the maximum value in output
within a subset of df
which is based on all rows in which input1
and input2
are below or equal to the respective input values in the respective row.
I managed to do that in a for loop easily:
output <- c(1:10)
input1 <- c(5,5,10,10,7,7,20,9,12,18)
input2 <- c(8,6,16,16,8,20,21,12,30,21)
df <- as.data.frame(cbind(output, input1, input2))
for (i in 1:nrow(df)){
df[i,"max"] <- max(df$output[df$input1 <= df$input1[i] &
df$input2 <= df$input2[i]])
}
However, it is not feasible with my original data with up to 1.000.000 observations.
Is there any option with apply or within data.table to speed up this process?
Upvotes: 0
Views: 85
Reputation: 388817
You can use fuzzyjoin::fuzzy_inner_join
:
library(dplyr)
fuzzyjoin::fuzzy_inner_join(df, df,
by = c('input1', 'input2'),
match_fun = c(`>=`, `>=`)) %>%
group_by(output = output.x, input1 = input1.x, input2 = input2.x) %>%
summarise(max = max(output.y))
# output input1 input2 max
# <dbl> <dbl> <dbl> <dbl>
# 1 1 5 8 2
# 2 2 5 6 2
# 3 3 10 16 8
# 4 4 10 16 8
# 5 5 7 8 5
# 6 6 7 20 6
# 7 7 20 21 10
# 8 8 9 12 8
# 9 9 12 30 9
#10 10 18 21 10
Upvotes: 1