user08041991
user08041991

Reputation: 637

Assign a value based on closest neighbour from other data frame

With generic data:

set.seed(456)

a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)

c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
  d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)

For each df1$b we want to find the nearest df2$d. Then we create a new variable df1$XYZ that takes the df2$c value of the nearest df2$d

This question has guided me towards data.table library. But I am not sure if ddplyr and group_by can also be used:

Here was my data.table attempt:

library(data.table)
dt1 <- data.table( df1 , key = "b" )
dt2 <- data.table( df2 , key = "d" )

dt[ ldt , list( d ) , roll = "nearest" ]

Upvotes: 3

Views: 422

Answers (2)

Daniel Vartanian
Daniel Vartanian

Reputation: 84

Here’s a fast and efficient dplyr solution for this problem:

set.seed(456)

a <- sample(0:1,50,replace = T)
b <- rnorm(50,15,5)
df1 <- data.frame(a,b)

c <- seq(0.01,0.99,0.01)
d <- rep(NA, 99)
for (i in 1:99) {
  d[i] <- 0.5*(10*c[i])^2+5
}
df2 <- data.frame(c,d)

df1 |>
  dplyr::left_join(
    dplyr::tibble(
      nearest_d_for_each_b = df2$d,
      c = df2$c
    ),
    by = dplyr::join_by(closest(b >= nearest_d_for_each_b))
  ) |>
  dplyr::rename(XYZ = c)
#>    a         b nearest_d_for_each_b  XYZ
#> 1  0 20.671423               20.125 0.55
#> 2  0 12.685725               12.605 0.39
#> 3  0 13.358080               13.000 0.40
#> 4  1 22.422697               22.405 0.59
#> 5  0  9.553110                9.500 0.30
#> 6  1 12.356029               12.220 0.38
#> 7  0 12.031036               11.845 0.37
#> 8  0  5.005422                5.005 0.01
#> 9  1 16.480766               16.045 0.47
#> 10 0 15.853126               15.580 0.46
#> 11 1 24.078262               23.605 0.61
#> 12 0 11.696984               11.480 0.36
#> 13 0 14.298740               14.245 0.43
#> 14 1 12.880104               12.605 0.39
#> 15 0 14.806321               14.680 0.44
#> 16 0 14.855290               14.680 0.44
#> 17 0 16.965187               16.520 0.48
#> 18 1 13.751930               13.405 0.41
#> 19 1 15.417251               15.125 0.45
#> 20 1 25.394373               24.845 0.63
#> 21 0 15.604259               15.580 0.46
#> 22 0 15.590747               15.580 0.46
#> 23 1 18.850271               18.520 0.52
#> 24 1  9.122988                8.920 0.28
#> 25 0 17.045193               17.005 0.49
#> 26 0 11.675246               11.480 0.36
#> 27 0 13.717376               13.405 0.41
#> 28 1 18.393911               18.005 0.51
#> 29 1 19.484223               19.045 0.53
#> 30 0 18.091782               18.005 0.51
#> 31 0 18.657269               18.520 0.52
#> 32 1 12.934127               12.605 0.39
#> 33 1 22.789066               22.405 0.59
#> 34 1 17.708494               17.500 0.50
#> 35 0 17.885752               17.500 0.50
#> 36 1  3.726452                   NA   NA
#> 37 1 10.089700                9.805 0.31
#> 38 1 13.997160               13.820 0.42
#> 39 0 19.479651               19.045 0.53
#> 40 0 14.782275               14.680 0.44
#> 41 1 23.236540               23.000 0.60
#> 42 1 22.816497               22.405 0.59
#> 43 1  7.843108                7.645 0.23
#> 44 0 12.277029               12.220 0.38
#> 45 0 11.143732               11.125 0.35
#> 46 1 14.150060               13.820 0.42
#> 47 1 14.796191               14.680 0.44
#> 48 0  6.397299                6.280 0.16
#> 49 1 22.220531               21.820 0.58
#> 50 1 14.899927               14.680 0.44

Created on 2024-11-23 with reprex v2.1.1

Upvotes: 0

Arun
Arun

Reputation: 118779

Here's one way with data.table:

require(data.table)
setDT(df1)[, XYZ := setDT(df2)[df1, c, on=c(d="b"), roll="nearest"]]

You need to get df2$c corresponding to the nearest value in df2$d for every df1$b. So, we need to join as df2[df1] which results in nrow(df1) rows.That can be done with setDT(df2)[df1, c, on=c(d="b"), roll="nearest"].

It returns the result you require. All we need to do is to add this back to df1 with the name XYZ. We do that using :=.


The thought process in constructing the rolling join is something like this (assuming df1 and df2 are both data tables):

  1. We need get some value(s) for each row of df1. That means, i = df1 in x[i] syntax.

    df2[df1]
    
  2. We need to join df2$d with df1$b. Using on= that'd be:

    df2[df1, on=c(d="b")]
    
  3. We need just the c column. Use j to select just that column.

    df2[df1, c, on=c(d="b")]
    
  4. We don't need equi-join but roll to nearest join.

    df2[df1, c, on=c(d="b"), roll="nearest"]
    

Hope this helps.

Upvotes: 4

Related Questions