Reputation: 1641
I have a matrix X
, two data frames A
and B
and to vectors of indices vec_a
and vec_b
. A
and B
contain an index variable each, where the values correspond to the values in vec_a
and vec_b
. Other than that, A
and B
contain as as many values as there are columns in X
:
# original data
X <- matrix(rnorm(200),100,2)
# values to substract in data.frames
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
# indices, which values to substract (one for each row of X)
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
What I want to achieve is the following: For each row iii
in X
get the values value1
and value2
from A
and B
based on elements iii
in the vectors vec_a
and vec_b
. Then, subtract these values from the corresponding row in X
. May sound a bit confusing, but I hope the following solution makes it more clear what the goal is:
# iterate over all rows of X
for(iii in 1:nrow(X)){
# get correct values
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
# this intermediate step is necessary, otherwise we substract a data.frame from a matrix
X_clean <- as.numeric(X_clean)
# subtract from X
X[iii,] = X[iii,] - X_clean
}
Note that we have to convert to numeric
in my loop solution, otherwise X
loses class matrix
as we subtract a data.frame
from a matrix
. My solution works fine, until you need to do that for many matrices like A
and B
and for millions of observations. Is there a solution that does not rely on looping over all rows?
EDIT
Thanks, both answers improve the speed of the code massively. I chose the answer by StupidWolf as it was more efficient than using data.table
:
Unit: microseconds
expr min lq mean median uq max neval cld
datatable 5557.355 5754.931 6052.402 5881.729 5975.386 14154.040 100 b
stupid.wolf 818.529 1172.840 1311.784 1187.593 1221.164 4777.743 100 a
loop 111748.790 115141.149 116677.528 116109.571 117085.048 156497.999 100 c
Upvotes: 1
Views: 265
Reputation: 24790
This approach uses data.table
for easy joining.
library(data.table)
set.seed(111)
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
setDT(A);setDT(B)
dtX <- as.data.table(cbind(1:nrow(X),X,vec_a,vec_b))
as.matrix(
dtX[A, on = .(vec_a = index_a)][B,
on = .(vec_b = index_b)][order(V1),
.(V2 - (value1 - i.value1), V3 - (value2 - i.value2))]
)
V1 V2
[1,] 0.22746 0.7069
[2,] 1.84340 -0.1258
[3,] -0.70038 1.2494
...
[98,] 2.04666 0.6767
[99,] 0.02451 1.0473
[100,] -2.72553 -0.6595
Hopefully this will be pretty fast for very large matrices.
Upvotes: 1
Reputation: 46898
You can just match the rows:
set.seed(111)
# original data
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
newX <- X - as.matrix(A[match(vec_a,A$index_a),-1]-B[match(vec_b,B$index_b),-1])
Then we run your loop:
for(iii in 1:nrow(X)){
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
X_clean <- as.numeric(X_clean)
X[iii,] = X[iii,] - X_clean
}
And check the values are equal:
all.equal(c(newX),c(X))
[1] TRUE
Match should be pretty fast, but if it is still too slow, you can just call out the values of A
using vec_a
, like A[vec_a,]
..
Upvotes: 2