Reputation: 497
I tried to solve the following question with the data.table package: Is there a faster way to subset a sparse Matrix than '['?
But I get the this error:
Error in Z[, cols] : invalid or not-yet-implemented 'Matrix' subsetting
10 stop("invalid or not-yet-implemented 'Matrix' subsetting")
9 Z[, cols]
8 Z[, cols]
7 FUN(X[[i]], ...)
6 lapply(X = ans[index], FUN = FUN, ...)
5 tapply(.SD, INDEX = "gene_name", FUN = simple_fun, Z = Z, simplify = FALSE)
4 eval(expr, envir, enclos)
3 eval(jsub, SDenv, parent.frame())
2 `[.data.table`(lkupdt, , tapply(.SD, INDEX = "gene_name", FUN = simple_fun,
Z = Z, simplify = FALSE), .SDcols = c("snps"))
1 lkupdt[, tapply(.SD, INDEX = "gene_name", FUN = simple_fun, Z = Z,
simplify = FALSE), .SDcols = c("snps")]
Here is my solution:
library(data.table)
library(Matrix)
seed(1)
n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05
n <- floor(n_subjects*n_snps*sparcity)
# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals
# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
lkupdt <- data.table(lkup)
lkupdt[, tapply(.SD, INDEX = 'gene_name' , FUN = simple_fun, Z = Z, simplify = FALSE), .SDcols = c('snps')]
The question is about the last line of code which tries to replicate the function above saved to "res". Am I doing something wrong with data.table or is this simply not possible? Thanks for your help!
Upvotes: 1
Views: 1067
Reputation: 6796
I think sum()
is too simple to estimate time and you would get a more suitable answer when you show a more real function
. (I approached without data.table()
)
For example, this function
looks equal or faster than a data.table()
approach (Of course, this approach can't be used with complex function
);
sum.func <- function(Z, lkup) {
Zsum <- colSums(Z)[lkup$snps]
Z2 <- cbind(Zsum, lkup$gene_name)
res <- c(tapply(Z2[,1], Z2[,2], sum))
names(res) <- levels(lkup$gene_name)
return(c(res))
}
system.time(
test.res <- sum.func(Z, lkup)
)
all.equal(unlist(res), test.res)
This is more general but clearly slower than data.table()
approach.
general.fun <- function(Z, lkup) {
Z2 <- Z[, lkup$snps]
num.gn <- as.numeric(lkup$gene_name)
res <- sapply(1:max(num.gn), function(x) sum(Z2[, which(num.gn == x)]))
names(res) <- levels(lkup$gene_name)
return(res)
}
system.time(
test.res2 <- general.fun(Z, lkup)
)
all.equal(unlist(res), test.res2)
Upvotes: 0
Reputation: 66819
No, I don't think you can speed up accessing a Matrix object using data.table. However, if you are willing to use a data.table instead of a Matrix...
ZDT = setDT(summary(Z))
system.time(
resDT <- ZDT[lkupdt, on = c(j = "snps")][, sum(x), by=gene_name]
)
# verify correctness
all.equal(
unname(unlist(res))[order(as.numeric(substring(names(res), 2, nchar(names(res)))))],
resDT$V1
)
It gives the result like
gene_name V1
1: g1 3.720619
2: g2 35.727923
3: g3 -3.949385
4: g4 -18.253456
5: g5 5.970879
---
496: g496 -20.979669
497: g497 63.880925
498: g498 16.498587
499: g499 -17.417110
500: g500 45.169608
Of course, you may need to keep the data in a sparse Matrix for other reasons, but this is a lot faster on my computer and has simpler input and output.
Upvotes: 1