Tyler Rinker
Tyler Rinker

Reputation: 109844

Minimum distance between elements in two logical vectors

I have two logical vectors x and y and weighted values, z corresponding to each index. For column x values that are TRUE I'd like to find the nearest y column index that is also TRUE. Then grab the sum of z between min{x_i, y_i}. If there are two min{x_i, y_i} then the smaller sum of z is used.

       x     y          z
1  FALSE  TRUE 0.05647057
2  FALSE FALSE 0.09577802
3   TRUE FALSE 0.04150954
4  FALSE FALSE 0.07242995
5  FALSE  TRUE 0.06220041
6  FALSE FALSE 0.01861535
7  FALSE FALSE 0.05056971
8   TRUE FALSE 0.07726933
9  FALSE  TRUE 0.04669694
10  TRUE  TRUE 0.02312497

There are 3 x values that are TRUE so we'll call them {x_1, x_2, x_3}. Here I demonstrate the summing of the minimum indexes between each x_i and it's nearest y_i neighbor. What is the most efficient base R way to accomplish this. I have a method at the end that utilizes 2 lapply telling me it's probably not efficient. I don't have a math background and usually there's some algebraic way to accomplish these sorts of tasks that is vectorized over the brute computational power.

## x_1
sum(z[3:5]) ## This one is smaller so use it
sum(z[1:3])

## x_2
sum(z[8:9])

## x_3
sum(z[10])

c(sum(z[3:5]), sum(z[8:9]), sum(z[10]))
[1] 0.17613990 0.12396627 0.02312497

MWE:

x <- y <- rep(FALSE, 10)
x[c(3, 8, 10)] <- TRUE
y[c(1, 5, 9, 10)] <- TRUE
set.seed(15)
z <- rnorm(10, .5, .25)/10
data.frame(x=x, y=y, z=z)

Here is an approach that is less than optimal:

dat <- data.frame(x=x, y=y, z=z)
sapply(which(dat[, "x"]), function(x) {
    ylocs <- which(dat[, "y"])
    dists <- abs(x - ylocs)
    min.ylocs <- ylocs[min(dists) == dists]
    min(sapply(min.ylocs, function(y, x2 = x) {
        sum(dat[, "z"][x2:y])
    }))
})

## [1] 0.17613990 0.12396627 0.02312497

I'd prefer to keep the solution within base.

Upvotes: 0

Views: 375

Answers (1)

G. Grothendieck
G. Grothendieck

Reputation: 269371

This uses no loops or apply functions. We use na.locf from zoo to move the index of the last TRUE y up giving fwd and the next TRUE y back giving bck. Finally we determine which of the two corresponding sums is greater. This depends on na.locf in the zoo package but at the end we extract the core code from zoo to avoid the dependence:

library(zoo) # na.locf

x <- dat$x
y <- dat$y
z <- dat$z

yy <- ifelse(y, TRUE, NA) * seq_along(y)
fwd <- na.locf(yy, fromLast = FALSE)[x]
bck <- na.locf(yy, fromLast = TRUE)[x]

cs <- cumsum(z)
pmin(cs[x] - cs[fwd] + z[fwd], cs[bck] - cs[x] + z[x])

The last line gives:

[1] 0.17613990 0.12396627 0.02312497

Here is a mini version of na.locf. The library call above could be replaced with this.

# code extracted from zoo package
na.locf <- function(x, fromLast = FALSE) {
   L <- !is.na(x)
   if (fromLast) rev(c(NA, rev(which(L)))[cumsum(rev(L)) + 1])
   else c(NA, which(L))[cumsum(L)+1L]
}

REVISED: some improvements.

Upvotes: 3

Related Questions