morpheus
morpheus

Reputation: 20310

R: How to compute correlation between rows of a matrix without having to transpose it?

I have a big matrix and am interested in computing the correlation between the rows of the matrix. Since the cor method computes correlation between the columns of a matrix, I am transposing the matrix before calling cor. But since the matrix is big, transposing it is expensive and is slowing down my program. Is there a way to compute the correlations among the rows without having to take transpose?

EDIT: thanks for the responses. thought i'd share some findings. my input matrix is 16 rows by 239766 cols and comes from a .mat file. I wrote C# code to do the same thing using the csmatio library. it looks like this:

foreach (var file in Directory.GetFiles(path, interictal_pattern))
            {
                var reader = new MatFileReader(file);
                var mla = reader.Data[0] as MLStructure;
                convert(mla.AllFields[0] as MLNumericArray<double>, data);

                double sum = 0;
                for (var i = 0; i < 16; i++)
                {
                    for (var j = i + 1; j < 16; j++)
                    {
                        sum += cor(data, i, j);
                    }
                }
                var avg = sum / 120;                
                if (++count == 10)
                {
                    var t2 = DateTime.Now;
                    var t = t2 - t1;
                    Console.WriteLine(t.TotalSeconds);
                    break;
                }
            }



        static double[][] createArray(int rows, int cols)
        {
            var ans = new double[rows][];
            for (var row = 0; row < rows; row++)
            {
                ans[row] = new double[cols];
            }
            return ans;
        }

        static void convert(MLNumericArray<double> mla, double[][] M)
        {
            var rows = M.Length;
            var cols = M[0].Length;
            for (int i = 0; i < rows; i++)
                for (int j = 0; j < cols; j++)
                    M[i][j] = mla.Get(i, j);
        }

        static double cor(double[][] M, int i, int j)
        {
            var count = M[0].Length;
            double sum1 = 0, sum2 = 0;
            for (int ctr = 0; ctr < count; ctr++)
            {
                sum1 += M[i][ctr];
                sum2 += M[j][ctr];
            }
            var mu1 = sum1 / count;
            var mu2 = sum2 / count;
            double numerator = 0, sumOfSquares1 = 0, sumOfSquares2 = 0;
            for (int ctr = 0; ctr < count; ctr++)
            {
                var x = M[i][ctr] - mu1;
                var y = M[j][ctr] - mu2;
                numerator += x * y;
                sumOfSquares1 += x * x;
                sumOfSquares2 += y * y;
            }
            return numerator / Math.Sqrt(sumOfSquares1 * sumOfSquares2);
        }       

this gave a throughput of 22.22s for 10 files or 2.22s/file

Then I profiled my R code:

ptm=proc.time()
for(file in files)
{   
    i = i + 1;
    mat = readMat(paste(path,file,sep=""))  
    a = t(mat[[1]][[1]])

    C = cor(a)
    correlations[i] = mean(C[lower.tri(C)])

}
print(proc.time()-ptm)

to my surprise its running faster than C# and is giving throughput of 5.7s per 10 files or 0.6s/file (an improvement of almost 4x!). The bottleneck in C# is the methods inside csmatio library to parse double values from input stream.

enter image description here

and if i do not convert the csmatio classes into a double[][] then the C# code runs extremely slow (order of magnitude slower ~20-30s/file).

Upvotes: 0

Views: 4914

Answers (3)

IRTFM
IRTFM

Reputation: 263301

Seeing that this problem arises from a data input issue whose details are not stated (and only hinted at in a comment), I will assume this is a comma-delimited file of unquoted numbers with the number of columns= Ncol. This does the transposition on input.

  in.mat <- matrix( scan("path/to/the_file/fil.txt", what =numeric(0), sep=","), 
                    ncol=Ncol, byrow=TRUE)

  cor(in.nmat)

Upvotes: 4

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

You can use outer:

outer(seq(nrow(mat)), seq(nrow(mat)), 
      Vectorize(function(x, y) cor(mat[x , ], mat[y , ])))

where mat is the name of your matrix.

Upvotes: 1

Teemu Daniel Laajala
Teemu Daniel Laajala

Reputation: 2366

One dirty work-around would be to apply cor-functions row-wise and produce the correlation matrix from the results. You could try if this is any more efficient (which I doubt, though you could fine-tune it by not double computing everything or the redundant diagonal cases):

# Apply 2-fold nested row-wise functions
set.seed(1)
dat <- matrix(rnorm(1000), nrow=10)
cormat <- apply(dat, MARGIN=1, FUN=function(z) apply(dat, MARGIN=1, FUN=function(y) cor(z, y)))
cormat[1:3,1:3] # Show few first
#            [,1]         [,2]       [,3]
#[1,] 1.000000000  0.002175792  0.1559263
#[2,] 0.002175792  1.000000000 -0.1870054
#[3,] 0.155926259 -0.187005418  1.0000000

Though, generally I would expect the transpose to have a really, really efficient implementation, so it's hard to imagine when that would be the bottle-neck. But, you could also dig through the implementation of 'cor' function and call the correlation C-function itself by first making sure your rows are suitable. Type 'cor' in the terminal to see the implementation, which is mostly a wrapper that makes input suitable for the C-function:

# Row with C-call from the implementation of 'cor':
#    if (method == "pearson") 
#        .Call(C_cor, x, y, na.method, FALSE)

Upvotes: 2

Related Questions