Reputation: 787
I have a piece of code now that lives inside an optimization routine. Below at the bottom are sample objects to use to see how this code works.
When all elements of the matrix X
are observed, the calculation is very efficient and can be written as follows using res1
. The object created in res2
produces the same result as res1
but loops over rows and is very expensive and inefficient in R.
### If everything is observed
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
all.equal(res1[,1], res2)
Now, the problem is in my real world scenario, there will often be missing values in the matrix X
. As such, the calculation for res1
would yield an NA
for its first element as shown in this new example (for obvious reasons, this is not my question). The object created by res2
gives exactly what I would need in this instance, but reverts to a loop and then becomes theoretically right in terms of what I want, but computationally not desirable.
### This would not work, as expected.
res1 <- exp(X %*% log(pr.t) + mX %*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
My question is whether anyone is aware of a way to produce the same result as res2
when there is missing data in X
as I do with the sapply()
method but is equally as efficient as the big matrix calculation?
I see two options, both of which I am exploring. One option could be to use parallel processing for the loop and a second option could be to use Rcpp. Both decent options. However, before going down either of those two pathways, I'm asking for some help to learn if anyone sees a really nice computational implementation that I am not seeing?
### Objects to run sample code
X <- structure(c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L,
5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4",
"Item 5")))
pr.t <- structure(c(0.000389840525419771, 0.000389840525419771, 0.000389840525419771,
0.000389840525419771, 0.000389840525419771, 0.00116782384335194,
0.00116782384335194, 0.00116782384335194, 0.00116782384335194,
0.00116782384335194, 0.00293127561410344, 0.00293127561410344,
0.00293127561410344, 0.00293127561410344, 0.00293127561410344,
0.00672641421586068, 0.00672641421586068, 0.00672641421586068,
0.00672641421586068, 0.00672641421586068, 0.0145666908055583,
0.0145666908055583, 0.0145666908055583, 0.0145666908055583, 0.0145666908055583,
0.0301824687604691, 0.0301824687604691, 0.0301824687604691, 0.0301824687604691,
0.0301824687604691, 0.0600531695657659, 0.0600531695657659, 0.0600531695657659,
0.0600531695657659, 0.0600531695657659, 0.114143103288218, 0.114143103288218,
0.114143103288218, 0.114143103288218, 0.114143103288218, 0.204278364784018,
0.204278364784018, 0.204278364784018, 0.204278364784018, 0.204278364784018,
0.336697623276164, 0.336697623276164, 0.336697623276164, 0.336697623276164,
0.336697623276164, 0.5, 0.5, 0.5, 0.5, 0.5, 0.663302376723836,
0.663302376723836, 0.663302376723836, 0.663302376723836, 0.663302376723836,
0.795721635215982, 0.795721635215982, 0.795721635215982, 0.795721635215982,
0.795721635215982, 0.885856896711782, 0.885856896711782, 0.885856896711782,
0.885856896711782, 0.885856896711782, 0.939946830434234, 0.939946830434234,
0.939946830434234, 0.939946830434234, 0.939946830434234, 0.969817531239531,
0.969817531239531, 0.969817531239531, 0.969817531239531, 0.969817531239531,
0.985433309194442, 0.985433309194442, 0.985433309194442, 0.985433309194442,
0.985433309194442, 0.993273585784139, 0.993273585784139, 0.993273585784139,
0.993273585784139, 0.993273585784139, 0.997068724385897, 0.997068724385897,
0.997068724385897, 0.997068724385897, 0.997068724385897, 0.998832176156648,
0.998832176156648, 0.998832176156648, 0.998832176156648, 0.998832176156648,
0.99961015947458, 0.99961015947458, 0.99961015947458, 0.99961015947458,
0.99961015947458), dim = c(5L, 21L))
wts <- c(2.09899121956567e-14, 4.97536860412164e-11, 1.45066128449311e-08,
1.22535483614825e-06, 4.21923474255167e-05, 0.000708047795481538,
0.00643969705140876, 0.033952729786543, 0.108392285626419, 0.21533371569506,
0.270260183572876, 0.21533371569506, 0.10839228562642, 0.0339527297865429,
0.00643969705140878, 0.000708047795481537, 4.21923474255168e-05,
1.22535483614826e-06, 1.45066128449309e-08, 4.97536860412161e-11,
2.09899121956567e-14)
mX <- 1 - X
Upvotes: 0
Views: 58
Reputation: 3902
Note that you use colSums(.,na.rm=T)
in res2
, which in this case is equivalent to setting missing value to 0. Therefore, we can do the same to res1
:
library(tidyr)
res1 <- exp(replace_na(X ,0)%*% log(pr.t) + replace_na(mX ,0)%*% log(1 - pr.t)) %*% wts
res2 <- sapply(1:nrow(X), function(i) exp(colSums(X[i,1:5] * log(pr.t[1:5,]), na.rm = TRUE) + colSums(mX[i,1:5] * log(1 - pr.t[1:5,]), na.rm=TRUE))%*% wts)
#Using the following X with 1 missing value
X <- structure(c(0L, 0L, NA_real_, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), dim = c(5L,
5L), dimnames = list(NULL, c("Item 1", "Item 2", "Item 3", "Item 4",
"Item 5")))
> all.equal(res1[,1], res2)
[1] TRUE
Upvotes: 3