Reputation: 43
I have a data.table containing the x,y,z values of 10,000 points (for this example) in the unit cube, and each point has a corresponding attribute (called P
). I've used nn2
from the RANN
package to find the k-neighbors (up to 50) indices of each point within a radius of 0.075 units from the original data.frame (which is returned as a matrix).
library(RANN)
library(data.table)
set.seed(1L) # for reproducible data
DATA <- data.table(runif(10000, 0,1),
runif(10000, 0,1),
runif(10000, 0,1),
runif(10000, 10,30))
colnames(DATA)<-c("x","y","z","P")
nn.idx <- nn2(DATA[,1:3], DATA[,1:3], k=50,
treetype = "kd", searchtype = "radius",
radius = 0.075)$nn.idx
The following for
loop does the job, but I was wondering if there was any way to speed this up by vectorizing it as this will not scale when applied to >millions of points? Simply put, I want to use nn.idx
to get the corresponding P
values from DATA
and calculate an average P
that is then assigned to a new column in DATA
called mean.P
for(index in 1:nrow(DATA))
DATA$mean.P[index]<-mean(DATA[nn.idx[index,], P])
For illustrative purposes, the following code illustrates what I'm trying to calculate - for all points (grey dots), calculate the mean value for all points (orange + red dots) in a sphere around a given point (red dot) and assign it to that point (red dot). Iterate over all points, but do it in an efficient way that will scale for big data sets.
library(rgl)
rgl.open()
rgl.points(DATA[1500,1], DATA[1500,2], DATA[1500,3], color ="red")
rgl.points(DATA[nn.idx[1500,],1:3], color ="orange", add=T)
rgl.points(DATA[,1:3], color ="lightgray", alpha=0.1, add=T)
I have never spent so much time trying to efficiently vectorize one single loop in my life! Also, I'm not opposed to punting and just doing it with c++ and Rcpp but I figured I'd ask here first to see if there was a way in R to make it scale and faster. Thanks in advance!
Upvotes: 4
Views: 279
Reputation: 42544
Here is another solution using melt()
to reshape the index matrix in long format, joining, and aggregating:
long <- melt(as.data.table(nn.idx)[, pt := .I], measure.vars = patterns("V"))
tmp <- long[DATA[, pt := .I], on = .(value = pt)][, mean(P), by = .(pt)][order(pt), V1]
DATA[, mean.P := tmp][, pt := NULL][]
The index matrix nn.idx
is converted to a data.table and gains a column pt
which is the row id of the points. Then the matrix is reshaped from wide to long format.
tmp
is the vector of mean values of neighbouring points. These are found by right joining DATA
with long
to match the indices of the nearest neighbouring points (in column value
) with the point index appended to DATA
beforehand.
The last step is to append the result as a new column in DATA
.
Alternatively, the intermediate result can be appended using a second join:
long <- melt(as.data.table(nn.idx)[, pt := .I], measure.vars = patterns("V"))
long[DATA[, pt := .I], on = .(value = pt)][, mean(P), by = .(pt)][DATA, on = "pt"]
Upvotes: 0
Reputation: 14667
Here is a solution that gives nearly a 100-fold speed increase. I don't fully understand why the improvement is so great, but perhaps one of the real data.table experts can comment on that.
library(RANN)
library(data.table)
set.seed(1L) # for reproducible data
DATA <- data.table(runif(10000, 0,1),
runif(10000, 0,1),
runif(10000, 0,1),
runif(10000, 10,30))
colnames(DATA)<-c("x","y","z","P")
nn.idx <- nn2(DATA[,1:3], DATA[,1:3], k=50,
treetype = "kd", searchtype = "radius",
radius = 0.075)$nn.idx
# (1)
# Timing for original loop.
system.time(for(index in 1:nrow(DATA)) {
DATA$mean.P[index] <- mean(DATA[nn.idx[index,], P])
})
# user system elapsed
# 7.830 0.850 8.684
# (2)
# Use `set()` instead of `$<-` and `[<-`.
system.time({for(index in 1:nrow(DATA)) {
set(DATA, i=index, j="mean_P_2", value=mean(DATA[nn.idx[index, ], P]))
}})
# user system elapsed
# 3.405 0.008 3.417
As you can see, there is a 2-fold improvement just by substituting the data.table-specific set()
function in the original loop.
Next, I've tried to put all the functionality into data.table-specific functions (mostly inside the data.table [] syntax). I've also put the P
values into a vector, because accessing values in vectors is typically much faster than similar operations on data.frames or data.tables.
# (3)
# Add row index.
DATA[, row_idx:=seq(nrow(DATA))]
# Isolate P values in a vector, because vector access is cheaper
# than data.table or data.frame access.
P_vec = DATA$P
system.time({
# Create a list column where each element is a vector of 50 integer indexes.
DATA[, nn_idx:=lapply(row_idx, function(i) nn.idx[i, ])]
# Use `:=` and `by=` to internalize the loop within `[.data.table`.
DATA[, mean_P_3:=mean(P_vec[nn_idx[[1]]]), by=row_idx]
})
# user system elapsed
# 0.092 0.002 0.095
# All results are identical.
all.equal(DATA$mean.P, DATA$mean_P_2)
# [1] TRUE
all.equal(DATA$mean.P, DATA$mean_P_3)
# [1] TRUE
This yields a nearly 100-fold speed increase when compared to the original loop.
It seems to scale pretty well up to 1 million data points:
# Try with 1 million data points.
set.seed(1L) # for reproducible data
DATA2 <- data.table(runif(1e6, 0,1),
runif(1e6, 0,1),
runif(1e6, 0,1),
runif(1e6, 10,30))
colnames(DATA2) <- c("x","y","z","P")
system.time({
nn.idx2 <- nn2(DATA2[,1:3], DATA2[,1:3], k=50,
treetype = "kd", searchtype = "radius",
radius = 0.075)$nn.idx
})
# user system elapsed
# 346.603 1.883 349.708
DATA2[, row_idx:=seq(nrow(DATA2))]
P_vec = DATA2$P
system.time({
DATA2[, nn_idx:=lapply(row_idx, function(i) nn.idx2[i, ])]
DATA2[, mean_P:=mean(P_vec[nn_idx[[1]]]), by=row_idx]
})
# user system elapsed
# 15.685 0.587 16.297
Timings were done on a single core of a 2011 macbook pro (Sandy Bridge 2.2Ghz). RAM usage stayed below 1.5 GB.
Upvotes: 2