Reputation: 77
I have two large dataframes called intersections
(representing intersections of a street system) and users
(representing users of a network) as follows:
intersections
has three columns: x
,y
and label_street
. They respectively represent the position of an intersection in a squared observation window (say [0,5] x [0,5]) and the street it is located on. Here is an example:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
head(intersections)
x y label_street
1 0.147674 0.132956 5
2 0.235356 0.150813 6
3 0.095337 0.087345 5
4 0.147674 0.132956 6
An intersection being located at the crossing of several streets, every (x,y)
combination in the intersections
table appears at least twice, but with different label_street
(e.g. rows 1 and 4 in the previous example). The label_street
may not be the row number (which is why it starts at 5 in my example).
users
has 4 columns: x
,y
, label_street
, ID
. They respectively represent the position of a user, the street it is located on and a unique ID
per user. There are no duplicates in this dataframe, as a user is located on a unique street and has a unique ID
. Here is an example (the ID
and the label_street
may not be the row number)
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), ID = c(2703, 3460, 4325, 12506, 19753, 21282))
head(users)
x y label_street ID
1 0.20428152 0.14448448 6 2703
2 0.17840619 0.13921481 6 3460
3 0.12964668 0.11724543 5 4325
4 0.20423856 0.14447573 6 12506
5 0.19349761 0.14228827 6 19753
6 0.10861251 0.09891443 5 21282
What I want to do is the following: for each point (x,y)
of intersections
, get the ID
and the distance to its closest neighbour sharing the same street_label
in users
I have a working solution using spatstat
function nncross
for nearest neighbour searching and plyr
function adply
for working on the data.
My working solution is as follows:
1) Write a user-defined function which gets the ID and the distance to the nearest neighbour of a row in a query table
NN <- function(row,query){
df <- row
window <- c(0,5,0,5) #Need this to convert to ppp objects and compute NN distance using nncross
NN <- nncross(as.ppp(row[,1:2],window),as.ppp(query[,1:2],window))
df$NN.ID <- query$ID[NN$which]
df$dist <- NN$dist
return(df)
}
2) Apply this user-defined function row-wise to my dataframe "intersections" with the query being the subset of users sharing the same street_label as the row :
result <- adply(intersections, 1, function(row) NN(row, users[users$label_street == row$label_street, ])
The result is as follows on the example:
head(result)
x y label_street NN.ID NN.dist
1 0.147674 0.132956 5 4325 0.02391247
2 0.235356 0.150813 6 2703 0.03171236
3 0.095337 0.087345 5 21282 0.01760940
4 0.147674 0.132956 6 3460 0.03136304
Since my real dataframes will be huge, I think computing distance matrices for looking at the nearest neighbour won't be efficient and that adply
will be slow.
Does anyone have an idea of a data.table
like solution? I only now about the basics of data.table
and have always found it very efficient compared to plyr
.
Upvotes: 3
Views: 742
Reputation: 502
This solution uses the RANN
package to find nearest neighbours. The trick is to first ensure that elements with different label_street
have a higher distance between them than elements within the same label_street
. We do this by adding an additional numeric column with a very large value that is constant within the same label_street
but different between different values of label_street
. In total, you get:
intersections <- data.frame(x=c(0.147674, 0.235356, 0.095337, 0.147674), y=c(0.132956, 0.150813, 0.087345, 0.132956), label_street = c(5,6,5,6))
users <- data.frame(x = c(0.20428152, 0.17840619, 0.12964668, 0.20423856, 0.19349761, 0.10861251), y = c(0.14448448, 0.13921481, 0.11724543, 0.14447573, 0.14228827, 0.09891443), label_street = c(6,6,5,6,6,5), number = c(2703, 3460, 4325, 12506, 19753, 21282))
# add a numeric column that is constant within each category and has a very large value
intersections$label_street_large <- intersections$label_street * 1e6
users$label_street_large <- users$label_street * 1e6
# call the nearest neighbour function (k = 1 neighbour)
nearest_neighbours <- RANN::nn2(
intersections[, c("x", "y", "label_street_large")],
users[, c("x", "y", "label_street_large")],
k = 1
)
# get original IDs and distances
IDs <- users$number[c(nearest_neighbours$nn.idx)]
distances <- c(nearest_neighbours$nn.dists)
IDs
# [1] 3460 12506 2703 3460 3460 4325
distances
# [1] 0.03171236 0.03136304 0.02391247 0.03175620 0.04271763 0.01760940
I hope this helps you. It should be very fast because it only call nn2
once, which runs in O(N * log(N)) time.
Upvotes: 2