Subtract vector from matrix based on data.frame efficiently

Question

I have a matrix X, two data frames A and B and to vectors of indices vec_a and vec_b. A and B contain an index variable each, where the values correspond to the values in vec_a and vec_b. Other than that, A and B contain as as many values as there are columns in X:

# original data
X <- matrix(rnorm(200),100,2)

# values to substract in data.frames
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))

# indices, which values to substract (one for each row of X)
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)

What I want to achieve is the following: For each row iii in X get the values value1 and value2 from A and B based on elements iii in the vectors vec_a and vec_b. Then, subtract these values from the corresponding row in X. May sound a bit confusing, but I hope the following solution makes it more clear what the goal is:

# iterate over all rows of X 
for(iii in 1:nrow(X)){

  # get correct values
  X_clean <- A[which(A$index_a == vec_a[iii]),-1] -    # subtract correct A value
             B[which(B$index_b == vec_b[iii]),-1]      # subtract correct B value

  # this intermediate step is necessary, otherwise we substract a data.frame from a matrix
  X_clean <- as.numeric(X_clean)  

  # subtract from X
  X[iii,] = X[iii,] - X_clean

}

Note that we have to convert to numeric in my loop solution, otherwise X loses class matrix as we subtract a data.frame from a matrix. My solution works fine, until you need to do that for many matrices like A and B and for millions of observations. Is there a solution that does not rely on looping over all rows?

EDIT

Thanks, both answers improve the speed of the code massively. I chose the answer by StupidWolf as it was more efficient than using data.table:

Unit: microseconds
        expr        min         lq       mean     median         uq        max neval cld
   datatable   5557.355   5754.931   6052.402   5881.729   5975.386  14154.040   100  b 
 stupid.wolf    818.529   1172.840   1311.784   1187.593   1221.164   4777.743   100 a  
        loop 111748.790 115141.149 116677.528 116109.571 117085.048 156497.999   100   c

StupidWolf · Accepted Answer

You can just match the rows:

set.seed(111)
# original data
X <- matrix(rnorm(200),100,2)

A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))

vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)

newX <- X - as.matrix(A[match(vec_a,A$index_a),-1]-B[match(vec_b,B$index_b),-1])

Then we run your loop:

for(iii in 1:nrow(X)){

  X_clean <- A[which(A$index_a == vec_a[iii]),-1] -    # subtract correct A value
             B[which(B$index_b == vec_b[iii]),-1]      # subtract correct B value

  X_clean <- as.numeric(X_clean)  
  X[iii,] = X[iii,] - X_clean
}

And check the values are equal:

all.equal(c(newX),c(X))
[1] TRUE

Match should be pretty fast, but if it is still too slow, you can just call out the values of A using vec_a, like A[vec_a,] ..

Subtract vector from matrix based on data.frame efficiently

Answers (2)

Related Questions