Reputation: 97
I have a dataset (df1) with a column that contains Remaining_points for each owner
Df1:
Id Owner Remaining_points
00001 John 18
00008 Paul 34
00011 Alba 52
00004 Martha 67
And another one with different id’s that contains points Df2
Id Points
00025 17
00076 35
00089 51
00092 68
I need to add to df2 a Owner column with most similar Remaining_points on df1
Desired dataframe:
Id Points Owner
00025 17 John
00076 35 Paul
00089 51 Alba
00092 68 Martha
Please, could anyone help me on this? I’m used to work with dplyr but any solution would be very appreciated.
Upvotes: 1
Views: 317
Reputation: 5287
@tacoman's works well. But I couldn't resist including a dplyr version. The cross join is doing a similar job to @tacoman's outer()
.
df1 <- data.frame(ID_1 = c("00001", "00008", "00011", "00004"),
Owner = c("John", "Paul", "Alba", "Martha"),
Remaining_points = c(18, 34, 52, 67))
df2 <- data.frame(ID_2 = c("00025", "00076", "00089", "00092"),
Points = c(17, 35, 51, 68))
df1 |>
dplyr::full_join(df2, by = character()) |> # This is essentially a cross join b/c no key is used.
dplyr::mutate(
distance = abs(Points - Remaining_points), # Find the difference in all possibilities
) |>
dplyr::group_by(ID_2) |> # Isolate each ID in its own sub-dataset
dplyr::mutate(
rank = dplyr::row_number(distance), # Rank the distances. The closest will be '1'.
) |>
dplyr::filter(rank == 1L) |> # Keep only the closest
dplyr::ungroup() |>
dplyr::select(
ID_2,
Points,
Owner
)
Result:
# A tibble: 4 x 3
ID_2 Points Owner
<chr> <dbl> <chr>
1 00025 17 John
2 00076 35 Paul
3 00089 51 Alba
4 00092 68 Martha
This is the intermediate result (before removing the extra rows and columns):
# A tibble: 16 x 7
# Groups: ID_2 [4]
ID_1 Owner Remaining_points ID_2 Points distance rank
<chr> <chr> <dbl> <chr> <dbl> <dbl> <int>
1 00001 John 18 00025 17 1 1 # <- closest for John
2 00001 John 18 00076 35 17 2
3 00001 John 18 00089 51 33 4
4 00001 John 18 00092 68 50 4
5 00008 Paul 34 00025 17 17 2
6 00008 Paul 34 00076 35 1 1 # <- closest for Paul
7 00008 Paul 34 00089 51 17 3
8 00008 Paul 34 00092 68 34 3
9 00011 Alba 52 00025 17 35 3
10 00011 Alba 52 00076 35 17 3
11 00011 Alba 52 00089 51 1 1 # <- closest for Alba
12 00011 Alba 52 00092 68 16 2
13 00004 Martha 67 00025 17 50 4
14 00004 Martha 67 00076 35 32 4
15 00004 Martha 67 00089 51 16 2
16 00004 Martha 67 00092 68 1 1 # <- closest for Martha
Upvotes: 1
Reputation: 947
df1 <- data.frame(ID = c("00001", "00008", "00011", "00004"),
Owner = c("John", "Paul", "Alba", "Martha"),
Remaining_points = c(18, 34, 52, 67))
df2 <- data.frame(ID = c("00025", "00076", "00089", "00092"),
Points = c(17, 35, 51, 68))
ind <- which(apply(abs(outer(df1$Remaining_points,df2$Points, "-")), 2, function(x) x == min(x)), arr.ind = TRUE)
df2$Owner <- df1$Owner[ind[,1]]
df2
ID Points Owner
1 00025 17 John
2 00076 35 Paul
3 00089 51 Alba
4 00092 68 Martha
Upvotes: 2