Reputation: 637
With generic data:
set.seed(456)
a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)
c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)
For each df1$b
we want to find the nearest df2$d
.
Then we create a new variable df1$XYZ
that takes the df2$c
value of the nearest df2$d
This question has guided me towards data.table
library. But I am not sure if ddplyr
and group_by
can also be used:
Here was my data.table
attempt:
library(data.table)
dt1 <- data.table( df1 , key = "b" )
dt2 <- data.table( df2 , key = "d" )
dt[ ldt , list( d ) , roll = "nearest" ]
Upvotes: 3
Views: 422
Reputation: 84
Here’s a fast and efficient dplyr
solution for this problem:
set.seed(456)
a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)
c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)
df1 |>
dplyr::left_join(
dplyr::tibble(
nearest_d_for_each_b = df2$d,
c = df2$c
),
by = dplyr::join_by(closest(b >= nearest_d_for_each_b))
) |>
dplyr::rename(XYZ = c)
#> a b nearest_d_for_each_b XYZ
#> 1 0 20.671423 20.125 0.55
#> 2 0 12.685725 12.605 0.39
#> 3 0 13.358080 13.000 0.40
#> 4 1 22.422697 22.405 0.59
#> 5 0 9.553110 9.500 0.30
#> 6 1 12.356029 12.220 0.38
#> 7 0 12.031036 11.845 0.37
#> 8 0 5.005422 5.005 0.01
#> 9 1 16.480766 16.045 0.47
#> 10 0 15.853126 15.580 0.46
#> 11 1 24.078262 23.605 0.61
#> 12 0 11.696984 11.480 0.36
#> 13 0 14.298740 14.245 0.43
#> 14 1 12.880104 12.605 0.39
#> 15 0 14.806321 14.680 0.44
#> 16 0 14.855290 14.680 0.44
#> 17 0 16.965187 16.520 0.48
#> 18 1 13.751930 13.405 0.41
#> 19 1 15.417251 15.125 0.45
#> 20 1 25.394373 24.845 0.63
#> 21 0 15.604259 15.580 0.46
#> 22 0 15.590747 15.580 0.46
#> 23 1 18.850271 18.520 0.52
#> 24 1 9.122988 8.920 0.28
#> 25 0 17.045193 17.005 0.49
#> 26 0 11.675246 11.480 0.36
#> 27 0 13.717376 13.405 0.41
#> 28 1 18.393911 18.005 0.51
#> 29 1 19.484223 19.045 0.53
#> 30 0 18.091782 18.005 0.51
#> 31 0 18.657269 18.520 0.52
#> 32 1 12.934127 12.605 0.39
#> 33 1 22.789066 22.405 0.59
#> 34 1 17.708494 17.500 0.50
#> 35 0 17.885752 17.500 0.50
#> 36 1 3.726452 NA NA
#> 37 1 10.089700 9.805 0.31
#> 38 1 13.997160 13.820 0.42
#> 39 0 19.479651 19.045 0.53
#> 40 0 14.782275 14.680 0.44
#> 41 1 23.236540 23.000 0.60
#> 42 1 22.816497 22.405 0.59
#> 43 1 7.843108 7.645 0.23
#> 44 0 12.277029 12.220 0.38
#> 45 0 11.143732 11.125 0.35
#> 46 1 14.150060 13.820 0.42
#> 47 1 14.796191 14.680 0.44
#> 48 0 6.397299 6.280 0.16
#> 49 1 22.220531 21.820 0.58
#> 50 1 14.899927 14.680 0.44
Created on 2024-11-23 with reprex v2.1.1
Upvotes: 0
Reputation: 118779
Here's one way with data.table
:
require(data.table)
setDT(df1)[, XYZ := setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]]
You need to get df2$c
corresponding to the nearest value in df2$d
for every df1$b
. So, we need to join as df2[df1]
which results in nrow(df1)
rows.That can be done with setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]
.
It returns the result you require. All we need to do is to add this back to df1
with the name XYZ
. We do that using :=
.
The thought process in constructing the rolling join is something like this (assuming df1
and df2
are both data tables):
We need get some value(s) for each row of df1
. That means, i = df1
in x[i]
syntax.
df2[df1]
We need to join df2$d
with df1$b
. Using on=
that'd be:
df2[df1, on=c(d="b")]
We need just the c
column. Use j
to select just that column.
df2[df1, c, on=c(d="b")]
We don't need equi-join but roll to nearest join.
df2[df1, c, on=c(d="b"), roll="nearest"]
Hope this helps.
Upvotes: 4