Reputation: 20310
I have a big matrix and am interested in computing the correlation between the rows of the matrix. Since the cor
method computes correlation between the columns of a matrix, I am transposing the matrix before calling cor
. But since the matrix is big, transposing it is expensive and is slowing down my program. Is there a way to compute the correlations among the rows without having to take transpose?
EDIT: thanks for the responses. thought i'd share some findings. my input matrix is 16 rows by 239766 cols and comes from a .mat file. I wrote C# code to do the same thing using the csmatio
library. it looks like this:
foreach (var file in Directory.GetFiles(path, interictal_pattern))
{
var reader = new MatFileReader(file);
var mla = reader.Data[0] as MLStructure;
convert(mla.AllFields[0] as MLNumericArray<double>, data);
double sum = 0;
for (var i = 0; i < 16; i++)
{
for (var j = i + 1; j < 16; j++)
{
sum += cor(data, i, j);
}
}
var avg = sum / 120;
if (++count == 10)
{
var t2 = DateTime.Now;
var t = t2 - t1;
Console.WriteLine(t.TotalSeconds);
break;
}
}
static double[][] createArray(int rows, int cols)
{
var ans = new double[rows][];
for (var row = 0; row < rows; row++)
{
ans[row] = new double[cols];
}
return ans;
}
static void convert(MLNumericArray<double> mla, double[][] M)
{
var rows = M.Length;
var cols = M[0].Length;
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
M[i][j] = mla.Get(i, j);
}
static double cor(double[][] M, int i, int j)
{
var count = M[0].Length;
double sum1 = 0, sum2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
sum1 += M[i][ctr];
sum2 += M[j][ctr];
}
var mu1 = sum1 / count;
var mu2 = sum2 / count;
double numerator = 0, sumOfSquares1 = 0, sumOfSquares2 = 0;
for (int ctr = 0; ctr < count; ctr++)
{
var x = M[i][ctr] - mu1;
var y = M[j][ctr] - mu2;
numerator += x * y;
sumOfSquares1 += x * x;
sumOfSquares2 += y * y;
}
return numerator / Math.Sqrt(sumOfSquares1 * sumOfSquares2);
}
this gave a throughput of 22.22s for 10 files or 2.22s/file
Then I profiled my R code:
ptm=proc.time()
for(file in files)
{
i = i + 1;
mat = readMat(paste(path,file,sep=""))
a = t(mat[[1]][[1]])
C = cor(a)
correlations[i] = mean(C[lower.tri(C)])
}
print(proc.time()-ptm)
to my surprise its running faster than C# and is giving throughput of 5.7s per 10 files or 0.6s/file (an improvement of almost 4x!). The bottleneck in C# is the methods inside csmatio library to parse double values from input stream.
and if i do not convert
the csmatio classes into a double[][]
then the C# code runs extremely slow (order of magnitude slower ~20-30s/file).
Upvotes: 0
Views: 4914
Reputation: 263301
Seeing that this problem arises from a data input issue whose details are not stated (and only hinted at in a comment), I will assume this is a comma-delimited file of unquoted numbers with the number of columns= Ncol. This does the transposition on input.
in.mat <- matrix( scan("path/to/the_file/fil.txt", what =numeric(0), sep=","),
ncol=Ncol, byrow=TRUE)
cor(in.nmat)
Upvotes: 4
Reputation: 81683
You can use outer
:
outer(seq(nrow(mat)), seq(nrow(mat)),
Vectorize(function(x, y) cor(mat[x , ], mat[y , ])))
where mat
is the name of your matrix.
Upvotes: 1
Reputation: 2366
One dirty work-around would be to apply cor-functions row-wise and produce the correlation matrix from the results. You could try if this is any more efficient (which I doubt, though you could fine-tune it by not double computing everything or the redundant diagonal cases):
# Apply 2-fold nested row-wise functions
set.seed(1)
dat <- matrix(rnorm(1000), nrow=10)
cormat <- apply(dat, MARGIN=1, FUN=function(z) apply(dat, MARGIN=1, FUN=function(y) cor(z, y)))
cormat[1:3,1:3] # Show few first
# [,1] [,2] [,3]
#[1,] 1.000000000 0.002175792 0.1559263
#[2,] 0.002175792 1.000000000 -0.1870054
#[3,] 0.155926259 -0.187005418 1.0000000
Though, generally I would expect the transpose to have a really, really efficient implementation, so it's hard to imagine when that would be the bottle-neck. But, you could also dig through the implementation of 'cor' function and call the correlation C-function itself by first making sure your rows are suitable. Type 'cor' in the terminal to see the implementation, which is mostly a wrapper that makes input suitable for the C-function:
# Row with C-call from the implementation of 'cor':
# if (method == "pearson")
# .Call(C_cor, x, y, na.method, FALSE)
Upvotes: 2