Reputation: 188
I have a large data frame (150000 rows) with X and Y as coordinates like df1 as follows:
df1 <- data.frame(X = c(7.48, 7.82, 8.15, 8.47, 8.80, 9.20, 9.51, 9.83, 10.13, 10.59, 7.59, 8.06, 8.39, 8.87, 9.26, 9.64, 10.09, 10.48, 10.88, 11.45),
Y = c(49.16, 48.78, 48.40, 48.03, 47.65, 47.24, 46.87, 46.51, 46.15, 45.73, 48.70, 48.18, 47.72, 47.20, 46.71, 46.23, 45.72, 45.24, 44.77, 44.23),
ID = c("B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2", "B_1_2"),
TI = c(191.31, 191.35, 191.39, 191.44, 191.48, 191.52, 191.56, 191.60, 191.64, 191.69, 1349.93, 1349.97, 1350.01, 1350.05, 1350.09, 1350.14, 1350.18, 1350.22, 1350.26, 1350.30))
in ID column, i have some 100-200 unique ID's and in each unique ID, i have 200-300 data points (rows)
i have another data frame like df2 as follows:
df2 <- data.frame(X = c(7.62, 8.25, 8.95, 9.71, 10.23),
Y = c(49.06, 48.30, 47.55, 46.77, 46.25))
now, based on each row in df2 i.e. x1 and y1, I would like to find out nearest XY in df1 with respect to a unique ID shown as:
df3 <-
X1 Y1 ID1 TI1 X2 Y2 ID2 TI2 X3 Y3 ID3 TI3 X4 Y4 ID4 TI4 X5 Y5 ID5 TI5
7.48 49.16 B_1 191.31 8.15 48.40 B_1 191.39 8.80 47.65 B_1 191.48 9.51 46.87 B_1 191.56 10.13 46.15 B_1 191.64
7.59 48.70 B_1_2 1349.93 8.06 48.18 B_1_2 1349.97 8.87 47.20 B_1_2 1350.05 9.26 46.71 B_1_2 1350.09 10.09 45.72 B_1_2 1350.18
i have tried with the following code:
dist12 <- function(row){
dists <- (row[["X"]] - df2$X)^2 + (row[["Y"]]- df2$Y)^2
return(cbind(df2[which.min(dists),], distance = min(dists)))
}
df3 <- cbind(df1, do.call(rbind, lapply(1:nrow(df1), function(x) dist12(df1[x,]))))
the code is finding the minimum distance between rows by calculating the distance between rows in df1 and df2 dataframes and combining both df1 and df2. from this code, it is assigning a single XY of df2 to multiple rows in df1. but a single row (XY) in df2 can be assigned only to one of the rows in a unique ID.
looking for the code to get the expected output (df3) as presented above
thanks in advance
Upvotes: 2
Views: 727
Reputation: 25225
Here is another option using Florian Privé's fast rowwise pairwise Euclidean distance between 2 matrices here:
m2 <- as.matrix(df2)
#https://stackoverflow.com/questions/59679046/speed-challenge-any-faster-method-to-calculate-distance-matrix-between-rows-of/59687204#59687204
method_XXX <- function(A, B) {
outer(rowSums(A^2), rowSums(B^2), '+') - tcrossprod(A, 2 * B)
}
setDT(df1)[, {
#calculate sq Euclidean distance and find min in each row using `max.col(-`
ans <- c(.(ID=rep(ID, nrow(m2))),
.SD[max.col(-method_XXX(m2, as.matrix(.SD[, -"TI"])))])
#create desired output
cnames <- paste0(rep(c("ID", names(.SD)), nrow(m2)),
rep(1L:nrow(m2), each=ncol(.SD)+1L))
setNames(as.list(c(do.call(rbind, ans))), cnames)
}, ID]
output:
ID ID1 X1 Y1 TI1 ID2 X2 Y2 TI2 ID3 X3 Y3 TI3
1: B_1 B_1 7.48 49.16 191.31 B_1 8.15 48.4 191.39 B_1 8.8 47.65 191.48
2: B_1_2 B_1_2 7.59 48.7 1349.93 B_1_2 8.06 48.18 1349.97 B_1_2 8.87 47.2 1350.05
ID4 X4 Y4 TI4 ID5 X5 Y5 TI5
1: B_1 9.51 46.87 191.56 B_1 10.13 46.15 191.64
2: B_1_2 9.26 46.71 1350.09 B_1_2 10.09 45.72 1350.18
Upvotes: 0
Reputation: 10385
Here is a taste with data.table for performance.
library(data.table)
df1=as.data.table(df1)
do.call(cbind,
apply(df2,1,function(i){
df1[,d:=(df1$X-i[1])^2+(df1$Y-i[2])^2]
df1[df1[,.I[d==min(d)],by=ID]$V1]
})
)
X Y ID TI d X Y ID TI d X Y ID TI d
1: 7.48 49.16 B_1 191.31 0.0296 8.15 48.40 B_1 191.39 0.0200 8.80 47.65 B_1 191.48 0.0325
2: 7.59 48.70 B_1_2 1349.93 0.1305 8.06 48.18 B_1_2 1349.97 0.0505 8.87 47.20 B_1_2 1350.05 0.1289
X Y ID TI d X Y ID TI d
1: 9.51 46.87 B_1 191.56 0.0500 10.13 46.15 B_1 191.64 0.0200
2: 9.26 46.71 B_1_2 1350.09 0.2061 10.09 45.72 B_1_2 1350.18 0.3005
Upvotes: 3