snape
snape

Reputation: 129

Effective way of calculating Correlation/Covariance Matrix row-wise

I have around 3000 files. Each file has a around 55000 rows/identifier and around ~100 columns. I need to calculate row-wise correlation or weighted covariance for each file (depending upon the number of columns in the file). The number of rows are same in all the files. I would like to know what is the most effective way to calculate the correlation matrix for each file ? I have tried Perl and C++ but it is taking a lot of time to process a file -- Perl takes 6 days, C takes more than a day. Typically, I don't want to take more than 15-20 minutes per file.

Now, I would like to know if I could process it faster using some trick or something. Here is my pseudo code:

while (using the file handler)
  reading the file line by line
  Storing the column values in hash1 where the key is the identifier
  Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler

for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
  calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
  calculate correlation coefficient.
end

Now, I am calculating the correlation coefficient for a pair only once and I am not calculating the correlation coefficient for the same identifier. I have taken care of that using my nested for loop. Do you think if there is a way to calculate the correlation coefficient faster ? Any hints/advice would be great. Thanks!

EDIT1: My Input File looks like this -- for the first 10 identifiers:

"Ident_01"  6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02"  449.803 367.757 302.633 318.037 331.55
"Ident_03"  16.4878 198.937 220.376 91.352  237.983
"Ident_04"  26.4878 398.937 130.376 92.352  177.983
"Ident_05"  36.4878 298.937 430.376 93.352  167.983
"Ident_06"  46.4878 498.937 560.376 94.352  157.983
"Ident_07"  56.4878 598.937 700.376 95.352  147.983
"Ident_08"  66.4878 698.937 990.376 96.352  137.983
"Ident_09"  76.4878 798.937 120.376 97.352  117.983
"Ident_10"  86.4878 898.937 450.376 98.352  127.983

EDIT2: here is snippet/subroutines or functions that I wrote in perl

## Pearson Correlation Coefficient
sub correlation {
    my( $arr1, $arr2) = @_;
    my $ssxy = ssxy( $arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean} );
    my $cor = $ssxy / sqrt( $arr1->{ssxx} * $arr2->{ssxx} );
    return $cor ;
}

## Mean
sub mean {
    my $arr1 = shift;
    my $mu_x = sum( @$arr1) /scalar(@$arr1);
    return($mu_x);
}

## Sum of Squared Deviations of x to the mean i.e. ssxx  
sub ssxx {
    my ( $arr1, $mean_x ) = @_;
    my $ssxx = 0;

    ## looping over all the samples
    for( my $i = 0; $i < @$arr1; $i++ ){
        $ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2;
    }
    return($ssxx); 
}

## Sum of Squared Deviations of xy to the mean i.e. ssxy 
sub ssxy {
    my( $arr1, $arr2, $mean_x, $mean_y ) = @_;
    my $ssxy = 0;

    ## looping over all the samples
    for( my $i = 0; $i < @$arr1; $i++ ){
        $ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $mean_y );
    }
    return ($ssxy);
}

Upvotes: 0

Views: 1651

Answers (3)

Degustaf
Degustaf

Reputation: 2670

@Sinan and @Praveen have the right idea for how to do this within perl. I would suggest that the overhead inherent in perl means you will never get the efficiency that you are looking for. I would suggest that you work on optimizing your C code.

First step would be to set the -O3 flag for maximum code optimization.

From there, I would change your ssxx code so that it subtracts the mean from each data point in place: x[i] -= mean. This means that you no longer need to subtract the mean in your ssxy code so that you do the subtraction once instead 55001 times.

I would check the disassembly to guarantee that the (x-mean)**2 is compiled to a multiplication, instead of 2^(2 * log(x - mean)), or just write it that way instead.

What sort of data structure are you using for your data? A double** with memory allocated for each row will lead to extra calls to (the slow function) malloc. Also, it is more likely to lead to memory thrashing with the allocated memory being located in different places. Ideally, you should have as few calls to malloc for as large as possible blocks of memory, and using pointer arithmetic to traverse the data.

More optimizations should be possible. If you post your code, I can make some suggestions.

Upvotes: 0

Sinan &#220;n&#252;r
Sinan &#220;n&#252;r

Reputation: 118138

While minor improvements might be possible, I would suggest investing in learning PDL. The documentation on matrix operations may be useful.

Upvotes: 0

Praveen
Praveen

Reputation: 902

Have you searched CPAN? Method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.

gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

Upvotes: 0

Related Questions