Reputation: 18008

Surrogate variable analysis fails with "subscript out of bounds"

I am trying to apply surrogate variable analysis using Bioconductor's sva package. The example in the vignette works fine, but when I try it with real data, I get a "subscript out of bounds" error in irwsva.build:

$ R

R version 2.15.0 (2012-03-30)
…
> trainData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData.txt')
> trainpheno <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
> testData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/testData.txt')
> trainData <- as.matrix(trainData)
> testData <- as.matrix(testData)
> library(sva)
> trainMod <- model.matrix(~as.factor(label), trainpheno)
> num.sv(trainData, trainMod)
[1] 8
> trainMod0 <- model.matrix(~1, trainpheno)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is:  8 
Iteration (out of 5 ):1  2  3  4  5  Error in irwsva.build(dat = dat, mod = mod, mod0 = mod0, n.sv = n.sv,  : 
  subscript out of bounds

An attempt to narrow it down with debug() revealed that fast.svd is being called on a 453 x 100 matrix of all zeros. (The dimensions 453 x 100 are the same as my training set.) This results in a V that is 100 x 0; the "subscript out of bounds" error is because irwsva.build attempts to index into V. There must be something about my data that causes this behaviour—but what?

As a possible workaround, I tried calling sva with method="two-step":

> trainSv <- sva(trainData, trainMod, trainMod0, method='two-step')
Number of significant surrogate variables is:  8

That worked, but I need to subsequently call fsva. That failed because calling sva with method="two-step" resulted in trainSv$pprob.b being NULL.

So how do my data differ from those in the vignette? The training and test data are matrices in both cases. In the vignette the training matrix is 22283 x 30; in my case, it is 453 x 100. In the vignette, the variable of interest (cancer) is binary; in my case, the dependent variable can take 12 different values.

The last difference seems to be important, for if I reduce the range to [0, 7], it works:

> trainMod <- model.matrix(~as.factor(label), trainpheno %% 8)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is:  9 
Iteration (out of 5 ):1  2  3  4  5  >

Thinking that perhaps 100 samples (columns) is just insufficient for 12 classes, I tried a similar dataset with 293 columns. (The data are from the same experiment, but profile the 293 individual samples rather than the 100 treatments.) It did not help:

> trainData <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData3.txt')
> trainpheno <- read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
> trainData <- as.matrix(trainData)
> trainMod <- model.matrix(~as.factor(label), trainpheno)
> trainMod0 <- model.matrix(~1, trainpheno)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is:  11 
Iteration (out of 5 ):1  2  3  4  5  Error in irwsva.build(dat = dat, mod = mod, mod0 = mod0, n.sv = n.sv,  : 
  subscript out of bounds

If I limit sva to one iteration, it is able to run to completion, but I don't know if I can trust the results:

> trainSv <- sva(trainData, trainMod, trainMod0, B=1)
Number of significant surrogate variables is:  11 
Iteration (out of 1 ):1  >

Does anyone understand irwsva well enough to say why this is happening? Is there anything I can do to make it work on my data?

Upvotes: 1

Answers (2)

Vebjorn Ljosa

Reputation: 18008

Response from John Leek (the author of sva) on the Bioconductor mailing list:

This problem is likely because of the small number of genes/features you are considering (453) and the high dimension of the response variable (12). With so many different levels of the response variable, many features are likely significantly associated with the response. Part of the iteration in the sva algorithm is to downweight features strongly associated with the response, so the whole data set is being down-weighted to 0.

I would suggest running only one iteration of sva. Usually it takes a very small number of iterations to converge, and since your data are relatively low dimensional in the number of features, this may be the best that you can do if you are doing artifact discovery.

Upvotes: 0

nograpes

Reputation: 18323

The proximal reason for the failure is that irwa.build uses Fast Singular Value Decomposition, which only returns the positive singular values of the matrix, as noted in ?fast.svd. In your data, the only values are zero, which is not positive, so you must use plain svd instead of fast.svd.

I have created a patched function sva.patched which patches the irwa.build and sva function slightly to handle this outside case. I essentially change one line in irwa.build:

# Before
sv = fast.svd(dats, tol = 0)$v[, 1:n.sv]
# After
if(any(dats!=0)) sv = fast.svd(dats, tol = 0)$v[, 1:n.sv]
else sv=svd(dats)$v[, 1:n.sv]

You can pick up the code here:

But the real question is, why did these data end up producing a zero-value matrix? I don't know much about this method, but I can give you some clues.

From what I can tell, you used the functions correctly. However, if you examine the loop irwsva.build function, you will find that it will return a zero matrix if the edge.ldfr function ever returns 0. This function will only return zero when there are no p-values returned by f.pvalue that are above 0.8.

Breaking down irwa.build, this is how it starts with your data:

dat=trainData
mod=trainMod
mod0=trainMod0
Id <- diag(ncol(dat))
resid <- dat %*% (Id - mod %*% solve(t(mod) %*% mod) %*% t(mod))
uu <- eigen(t(resid) %*% resid)
# Iterations begin.
mod.b <- cbind(mod, uu$vectors[, 1:n.sv])
mod0.b <- cbind(mod0, uu$vectors[, 1:n.sv])
ptmp <- f.pvalue(dat, mod.b, mod0.b)
which(ptmp>0.8)
# Only one value

Now, the first time you go through the loop, there is only one p-value that is above 0.8. By the second iteration, there are none, which is the cause of all the zeroes.

If you run the same code on the vignette data, you will find that it has many p-values above 0.8, which is why it doesn't return an error.

Upvotes: 3

Surrogate variable analysis fails with &quot;subscript out of bounds&quot;

Answers (2)

Related Questions

Surrogate variable analysis fails with "subscript out of bounds"