Andre Sandor
Andre Sandor

Reputation: 43

Mean of data.table column values as specified using a matrix

I have a data.table containing the x,y,z values of 10,000 points (for this example) in the unit cube, and each point has a corresponding attribute (called P). I've used nn2 from the RANN package to find the k-neighbors (up to 50) indices of each point within a radius of 0.075 units from the original data.frame (which is returned as a matrix).

library(RANN)
library(data.table)

set.seed(1L) # for reproducible data
DATA <- data.table(runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 10,30))
colnames(DATA)<-c("x","y","z","P")
nn.idx <- nn2(DATA[,1:3], DATA[,1:3], k=50, 
              treetype = "kd", searchtype = "radius", 
              radius = 0.075)$nn.idx

The following for loop does the job, but I was wondering if there was any way to speed this up by vectorizing it as this will not scale when applied to >millions of points? Simply put, I want to use nn.idx to get the corresponding P values from DATA and calculate an average P that is then assigned to a new column in DATA called mean.P

for(index in 1:nrow(DATA))
  DATA$mean.P[index]<-mean(DATA[nn.idx[index,], P])

For illustrative purposes, the following code illustrates what I'm trying to calculate - for all points (grey dots), calculate the mean value for all points (orange + red dots) in a sphere around a given point (red dot) and assign it to that point (red dot). Iterate over all points, but do it in an efficient way that will scale for big data sets.

library(rgl)
rgl.open()
rgl.points(DATA[1500,1], DATA[1500,2], DATA[1500,3], color ="red")
rgl.points(DATA[nn.idx[1500,],1:3], color ="orange", add=T)
rgl.points(DATA[,1:3], color ="lightgray", alpha=0.1, add=T)

enter image description here

I have never spent so much time trying to efficiently vectorize one single loop in my life! Also, I'm not opposed to punting and just doing it with c++ and Rcpp but I figured I'd ask here first to see if there was a way in R to make it scale and faster. Thanks in advance!

Upvotes: 4

Views: 279

Answers (2)

Uwe
Uwe

Reputation: 42544

Here is another solution using melt() to reshape the index matrix in long format, joining, and aggregating:

long <- melt(as.data.table(nn.idx)[, pt := .I], measure.vars = patterns("V"))
tmp <- long[DATA[, pt := .I], on = .(value = pt)][, mean(P), by = .(pt)][order(pt), V1]
DATA[, mean.P := tmp][, pt := NULL][]

Explanation

The index matrix nn.idx is converted to a data.table and gains a column pt which is the row id of the points. Then the matrix is reshaped from wide to long format.

tmp is the vector of mean values of neighbouring points. These are found by right joining DATA with long to match the indices of the nearest neighbouring points (in column value) with the point index appended to DATA beforehand.

The last step is to append the result as a new column in DATA.

Variant 2

Alternatively, the intermediate result can be appended using a second join:

long <- melt(as.data.table(nn.idx)[, pt := .I], measure.vars = patterns("V"))
    long[DATA[, pt := .I], on = .(value = pt)][, mean(P), by = .(pt)][DATA, on = "pt"]

Upvotes: 0

bdemarest
bdemarest

Reputation: 14667

Here is a solution that gives nearly a 100-fold speed increase. I don't fully understand why the improvement is so great, but perhaps one of the real data.table experts can comment on that.

library(RANN)
library(data.table)

set.seed(1L) # for reproducible data
DATA <- data.table(runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 0,1), 
                   runif(10000, 10,30))
colnames(DATA)<-c("x","y","z","P")
nn.idx <- nn2(DATA[,1:3], DATA[,1:3], k=50, 
              treetype = "kd", searchtype = "radius", 
              radius = 0.075)$nn.idx

# (1)
# Timing for original loop.
system.time(for(index in 1:nrow(DATA)) {
    DATA$mean.P[index] <- mean(DATA[nn.idx[index,], P])
})
#    user  system elapsed 
#   7.830   0.850   8.684 

# (2)
# Use `set()` instead of `$<-` and `[<-`.
system.time({for(index in 1:nrow(DATA)) {
    set(DATA, i=index, j="mean_P_2", value=mean(DATA[nn.idx[index, ], P]))
}})
#    user  system elapsed 
#   3.405   0.008   3.417 

As you can see, there is a 2-fold improvement just by substituting the data.table-specific set() function in the original loop.

Next, I've tried to put all the functionality into data.table-specific functions (mostly inside the data.table [] syntax). I've also put the P values into a vector, because accessing values in vectors is typically much faster than similar operations on data.frames or data.tables.

# (3)
# Add row index.
DATA[, row_idx:=seq(nrow(DATA))]

# Isolate P values in a vector, because vector access is cheaper
# than data.table or data.frame access.
P_vec = DATA$P

system.time({
    # Create a list column where each element is a vector of 50 integer indexes.
    DATA[, nn_idx:=lapply(row_idx, function(i) nn.idx[i, ])]
    # Use `:=` and `by=` to internalize the loop within `[.data.table`.
    DATA[, mean_P_3:=mean(P_vec[nn_idx[[1]]]), by=row_idx]
})
#    user  system elapsed 
#   0.092   0.002   0.095 

# All results are identical.
all.equal(DATA$mean.P, DATA$mean_P_2)
# [1] TRUE
all.equal(DATA$mean.P, DATA$mean_P_3)
# [1] TRUE

This yields a nearly 100-fold speed increase when compared to the original loop.

It seems to scale pretty well up to 1 million data points:

# Try with 1 million data points.
set.seed(1L) # for reproducible data
DATA2 <- data.table(runif(1e6, 0,1), 
                    runif(1e6, 0,1), 
                    runif(1e6, 0,1), 
                    runif(1e6, 10,30))
colnames(DATA2) <- c("x","y","z","P")

system.time({
    nn.idx2 <- nn2(DATA2[,1:3], DATA2[,1:3], k=50, 
                   treetype = "kd", searchtype = "radius", 
                   radius = 0.075)$nn.idx
})
#    user  system elapsed 
# 346.603   1.883 349.708 


DATA2[, row_idx:=seq(nrow(DATA2))]
P_vec = DATA2$P

system.time({
    DATA2[, nn_idx:=lapply(row_idx, function(i) nn.idx2[i, ])]
    DATA2[, mean_P:=mean(P_vec[nn_idx[[1]]]), by=row_idx]
})
#    user  system elapsed 
#  15.685   0.587  16.297 

Timings were done on a single core of a 2011 macbook pro (Sandy Bridge 2.2Ghz). RAM usage stayed below 1.5 GB.

Upvotes: 2

Related Questions