Reputation: 18008
I am trying to apply surrogate variable analysis using Bioconductor's sva package. The example in the vignette works fine, but when I try it with real data, I get a "subscript out of bounds" error in irwsva.build
:
$ R
R version 2.15.0 (2012-03-30)
…
> trainData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData.txt')
> trainpheno <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
> testData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/testData.txt')
> trainData <- as.matrix(trainData)
> testData <- as.matrix(testData)
> library(sva)
> trainMod <- model.matrix(~as.factor(label), trainpheno)
> num.sv(trainData, trainMod)
[1] 8
> trainMod0 <- model.matrix(~1, trainpheno)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is: 8
Iteration (out of 5 ):1 2 3 4 5 Error in irwsva.build(dat = dat, mod = mod, mod0 = mod0, n.sv = n.sv, :
subscript out of bounds
An attempt to narrow it down with debug()
revealed that fast.svd
is being called on a 453 x 100 matrix of all zeros. (The dimensions 453 x 100 are the same as my training set.) This results in a V
that is 100 x 0; the "subscript out of bounds" error is because irwsva.build
attempts to index into V
. There must be something about my data that causes this behaviour—but what?
As a possible workaround, I tried calling sva
with method="two-step"
:
> trainSv <- sva(trainData, trainMod, trainMod0, method='two-step')
Number of significant surrogate variables is: 8
That worked, but I need to subsequently call fsva
. That failed because calling sva
with method="two-step"
resulted in trainSv$pprob.b
being NULL.
So how do my data differ from those in the vignette? The training and test data are matrices in both cases. In the vignette the training matrix is 22283 x 30; in my case, it is 453 x 100. In the vignette, the variable of interest (cancer) is binary; in my case, the dependent variable can take 12 different values.
The last difference seems to be important, for if I reduce the range to [0, 7], it works:
> trainMod <- model.matrix(~as.factor(label), trainpheno %% 8)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is: 9
Iteration (out of 5 ):1 2 3 4 5 >
Thinking that perhaps 100 samples (columns) is just insufficient for 12 classes, I tried a similar dataset with 293 columns. (The data are from the same experiment, but profile the 293 individual samples rather than the 100 treatments.) It did not help:
> trainData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData3.txt')
> trainpheno <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
> trainData <- as.matrix(trainData)
> trainMod <- model.matrix(~as.factor(label), trainpheno)
> trainMod0 <- model.matrix(~1, trainpheno)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5 Error in irwsva.build(dat = dat, mod = mod, mod0 = mod0, n.sv = n.sv, :
subscript out of bounds
If I limit sva to one iteration, it is able to run to completion, but I don't know if I can trust the results:
> trainSv <- sva(trainData, trainMod, trainMod0, B=1)
Number of significant surrogate variables is: 11
Iteration (out of 1 ):1 >
Does anyone understand irwsva
well enough to say why this is happening? Is there anything I can do to make it work on my data?
Upvotes: 1
Views: 1532
Reputation: 18008
Response from John Leek (the author of sva
) on the Bioconductor mailing list:
This problem is likely because of the small number of genes/features you are considering (453) and the high dimension of the response variable (12). With so many different levels of the response variable, many features are likely significantly associated with the response. Part of the iteration in the sva algorithm is to downweight features strongly associated with the response, so the whole data set is being down-weighted to 0.
I would suggest running only one iteration of sva. Usually it takes a very small number of iterations to converge, and since your data are relatively low dimensional in the number of features, this may be the best that you can do if you are doing artifact discovery.
Upvotes: 0
Reputation: 18323
The proximal reason for the failure is that irwa.build
uses Fast Singular Value Decomposition, which only returns the positive singular values of the matrix, as noted in ?fast.svd
. In your data, the only values are zero, which is not positive, so you must use plain svd
instead of fast.svd
.
I have created a patched function sva.patched
which patches the irwa.build
and sva
function slightly to handle this outside case. I essentially change one line in irwa.build
:
# Before
sv = fast.svd(dats, tol = 0)$v[, 1:n.sv]
# After
if(any(dats!=0)) sv = fast.svd(dats, tol = 0)$v[, 1:n.sv]
else sv=svd(dats)$v[, 1:n.sv]
You can pick up the code here:
But the real question is, why did these data end up producing a zero-value matrix? I don't know much about this method, but I can give you some clues.
From what I can tell, you used the functions correctly. However, if you examine the loop irwsva.build
function, you will find that it will return a zero matrix if the edge.ldfr
function ever returns 0. This function will only return zero when there are no p-values returned by f.pvalue
that are above 0.8.
Breaking down irwa.build
, this is how it starts with your data:
dat=trainData
mod=trainMod
mod0=trainMod0
Id <- diag(ncol(dat))
resid <- dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod))
uu <- eigen(t(resid) %*% resid)
# Iterations begin.
mod.b <- cbind(mod, uu$vectors[, 1:n.sv])
mod0.b <- cbind(mod0, uu$vectors[, 1:n.sv])
ptmp <- f.pvalue(dat, mod.b, mod0.b)
which(ptmp>0.8)
# Only one value
Now, the first time you go through the loop, there is only one p-value that is above 0.8. By the second iteration, there are none, which is the cause of all the zeroes.
If you run the same code on the vignette data, you will find that it has many p-values above 0.8, which is why it doesn't return an error.
Upvotes: 3