Reputation: 2543
For each point (x,y) in a data frame, I want to calculate the sum of the euclidean distances from that point to all other points in the data frame that do not have the same 'group' label. Here is a hacky for-loop version of what I'm trying to achieve:
# some fake data
d <- data.frame(group=rep(c('a','b','c'),each=3), x=sample(1:9), y=sample(1:9), z=NA)
for (i in 1:nrow(d)) {
d2 <- subset(d,group!=d$group[i])
d$z[i] <- sum(sqrt((d$x[i]-d2$x)^2 + (d$y[i]-d2$y)^2))
}
For example, the desired value for point a1 should be the sum of distances from a1 to each of b1, b2, b3, c1, c2, c3, but NOT including the distances a1-a2 or a1-a3. Is there a vectorized way to accomplish this? I'm sure it's an obvious solution... I've tried various configurations of by()
and apply()
but can't seem to hit on the answer.
Upvotes: 1
Views: 127
Reputation: 2543
Results of benchmarking Backlin's solution vs loop (made the sample data a bit bigger to amplify difference):
d <- data.frame(group=rep(letters[1:10],each=100), x=sample(1:1000), y=sample(1:1000), z=NA)
loopMethod <- function(d) {
for (i in 1:nrow(d)) {
d2 <- subset(d,group!=d$group[i])
d$z[i] <- sum(sqrt((d$x[i]-d2$x)^2 + (d$y[i]-d2$y)^2))
}
}
backlinMethod <- function(d) {
dists <- as.matrix(dist(d[2:3]))
d$z <- sapply(seq(d$group), function(i) sum(dists[i, !d$group %in% d$group[i]]))
}
system.time(loopMethod(d))
user system elapsed
1.020 0.004 1.021
system.time(backlinMethod(d))
user system elapsed
0.472 0.052 0.525
Upvotes: 1
Reputation: 14842
There is a very nice way to solve this efficiently: precalculate all distances and subset them rather than the points, to avoid repeating the same calculations.
dists <- as.matrix(dist(d[2:3]))
d$z <- sapply(seq(d$group), function(i) sum(dists[i, !d$group %in% d$group[i]]))
Upvotes: 3