Mean of data.table column values as specified using a matrix

Question

I have a data.table containing the x,y,z values of 10,000 points (for this example) in the unit cube, and each point has a corresponding attribute (called P). I've used nn2 from the RANN package to find the k-neighbors (up to 50) indices of each point within a radius of 0.075 units from the original data.frame (which is returned as a matrix).

library(RANN)
library(data.table)

set.seed(1L) # for reproducible data
DATA <- data.table(runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 10,30))
colnames(DATA)<-c("x","y","z","P")
nn.idx <- nn2(DATA[,1:3], DATA[,1:3], k=50, 
              treetype = "kd", searchtype = "radius", 
              radius = 0.075)$nn.idx

The following for loop does the job, but I was wondering if there was any way to speed this up by vectorizing it as this will not scale when applied to >millions of points? Simply put, I want to use nn.idx to get the corresponding P values from DATA and calculate an average P that is then assigned to a new column in DATA called mean.P

for(index in 1:nrow(DATA))
  DATA$mean.P[index]<-mean(DATA[nn.idx[index,], P])

For illustrative purposes, the following code illustrates what I'm trying to calculate - for all points (grey dots), calculate the mean value for all points (orange + red dots) in a sphere around a given point (red dot) and assign it to that point (red dot). Iterate over all points, but do it in an efficient way that will scale for big data sets.

library(rgl)
rgl.open()
rgl.points(DATA[1500,1], DATA[1500,2], DATA[1500,3], color ="red")
rgl.points(DATA[nn.idx[1500,],1:3], color ="orange", add=T)
rgl.points(DATA[,1:3], color ="lightgray", alpha=0.1, add=T)

I have never spent so much time trying to efficiently vectorize one single loop in my life! Also, I'm not opposed to punting and just doing it with c++ and Rcpp but I figured I'd ask here first to see if there was a way in R to make it scale and faster. Thanks in advance!

bdemarest · Accepted Answer

Here is a solution that gives nearly a 100-fold speed increase. I don't fully understand why the improvement is so great, but perhaps one of the real data.table experts can comment on that.

library(RANN)
library(data.table)

set.seed(1L) # for reproducible data
DATA <- data.table(runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 10,30))
colnames(DATA)<-c("x","y","z","P")
nn.idx <- nn2(DATA[,1:3], DATA[,1:3], k=50, 
              treetype = "kd", searchtype = "radius", 
              radius = 0.075)$nn.idx

# (1)
# Timing for original loop.
system.time(for(index in 1:nrow(DATA)) {
    DATA$mean.P[index] <- mean(DATA[nn.idx[index,], P])
})
#    user  system elapsed 
#   7.830   0.850   8.684 

# (2)
# Use `set()` instead of `$<-` and `[<-`.
system.time({for(index in 1:nrow(DATA)) {
    set(DATA, i=index, j="mean_P_2", value=mean(DATA[nn.idx[index, ], P]))
}})
#    user  system elapsed 
#   3.405   0.008   3.417

As you can see, there is a 2-fold improvement just by substituting the data.table-specific set() function in the original loop.

Next, I've tried to put all the functionality into data.table-specific functions (mostly inside the data.table [] syntax). I've also put the P values into a vector, because accessing values in vectors is typically much faster than similar operations on data.frames or data.tables.

# (3)
# Add row index.
DATA[, row_idx:=seq(nrow(DATA))]

# Isolate P values in a vector, because vector access is cheaper
# than data.table or data.frame access.
P_vec = DATA$P

system.time({
    # Create a list column where each element is a vector of 50 integer indexes.
    DATA[, nn_idx:=lapply(row_idx, function(i) nn.idx[i, ])]
    # Use `:=` and `by=` to internalize the loop within `[.data.table`.
    DATA[, mean_P_3:=mean(P_vec[nn_idx[[1]]]), by=row_idx]
})
#    user  system elapsed 
#   0.092   0.002   0.095 

# All results are identical.
all.equal(DATA$mean.P, DATA$mean_P_2)
# [1] TRUE
all.equal(DATA$mean.P, DATA$mean_P_3)
# [1] TRUE

This yields a nearly 100-fold speed increase when compared to the original loop.

It seems to scale pretty well up to 1 million data points:

# Try with 1 million data points.
set.seed(1L) # for reproducible data
DATA2 <- data.table(runif(1e6, 0,1), 
                    runif(1e6, 0,1), 
                    runif(1e6, 0,1), 
                    runif(1e6, 10,30))
colnames(DATA2) <- c("x","y","z","P")

system.time({
    nn.idx2 <- nn2(DATA2[,1:3], DATA2[,1:3], k=50, 
                   treetype = "kd", searchtype = "radius", 
                   radius = 0.075)$nn.idx
})
#    user  system elapsed 
# 346.603   1.883 349.708 


DATA2[, row_idx:=seq(nrow(DATA2))]
P_vec = DATA2$P

system.time({
    DATA2[, nn_idx:=lapply(row_idx, function(i) nn.idx2[i, ])]
    DATA2[, mean_P:=mean(P_vec[nn_idx[[1]]]), by=row_idx]
})
#    user  system elapsed 
#  15.685   0.587  16.297

Timings were done on a single core of a 2011 macbook pro (Sandy Bridge 2.2Ghz). RAM usage stayed below 1.5 GB.

Mean of data.table column values as specified using a matrix

Answers (2)

Explanation

Variant 2

Related Questions