Reputation: 129
I have around 3000 files. Each file has a around 55000 rows/identifier and around ~100 columns. I need to calculate row-wise correlation or weighted covariance for each file (depending upon the number of columns in the file). The number of rows are same in all the files. I would like to know what is the most effective way to calculate the correlation matrix for each file ? I have tried Perl and C++ but it is taking a lot of time to process a file -- Perl takes 6 days, C takes more than a day. Typically, I don't want to take more than 15-20 minutes per file.
Now, I would like to know if I could process it faster using some trick or something. Here is my pseudo code:
while (using the file handler)
reading the file line by line
Storing the column values in hash1 where the key is the identifier
Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler
for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
calculate correlation coefficient.
end
Now, I am calculating the correlation coefficient for a pair only once and I am not calculating the correlation coefficient for the same identifier. I have taken care of that using my nested for loop. Do you think if there is a way to calculate the correlation coefficient faster ? Any hints/advice would be great. Thanks!
EDIT1: My Input File looks like this -- for the first 10 identifiers:
"Ident_01" 6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02" 449.803 367.757 302.633 318.037 331.55
"Ident_03" 16.4878 198.937 220.376 91.352 237.983
"Ident_04" 26.4878 398.937 130.376 92.352 177.983
"Ident_05" 36.4878 298.937 430.376 93.352 167.983
"Ident_06" 46.4878 498.937 560.376 94.352 157.983
"Ident_07" 56.4878 598.937 700.376 95.352 147.983
"Ident_08" 66.4878 698.937 990.376 96.352 137.983
"Ident_09" 76.4878 798.937 120.376 97.352 117.983
"Ident_10" 86.4878 898.937 450.376 98.352 127.983
EDIT2: here is snippet/subroutines or functions that I wrote in perl
## Pearson Correlation Coefficient
sub correlation {
my( $arr1, $arr2) = @_;
my $ssxy = ssxy( $arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean} );
my $cor = $ssxy / sqrt( $arr1->{ssxx} * $arr2->{ssxx} );
return $cor ;
}
## Mean
sub mean {
my $arr1 = shift;
my $mu_x = sum( @$arr1) /scalar(@$arr1);
return($mu_x);
}
## Sum of Squared Deviations of x to the mean i.e. ssxx
sub ssxx {
my ( $arr1, $mean_x ) = @_;
my $ssxx = 0;
## looping over all the samples
for( my $i = 0; $i < @$arr1; $i++ ){
$ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2;
}
return($ssxx);
}
## Sum of Squared Deviations of xy to the mean i.e. ssxy
sub ssxy {
my( $arr1, $arr2, $mean_x, $mean_y ) = @_;
my $ssxy = 0;
## looping over all the samples
for( my $i = 0; $i < @$arr1; $i++ ){
$ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $mean_y );
}
return ($ssxy);
}
Upvotes: 0
Views: 1651
Reputation: 2670
@Sinan and @Praveen have the right idea for how to do this within perl. I would suggest that the overhead inherent in perl means you will never get the efficiency that you are looking for. I would suggest that you work on optimizing your C code.
First step would be to set the -O3
flag for maximum code optimization.
From there, I would change your ssxx
code so that it subtracts the mean from each data point in place: x[i] -= mean
. This means that you no longer need to subtract the mean in your ssxy
code so that you do the subtraction once instead 55001 times.
I would check the disassembly to guarantee that the (x-mean)**2
is compiled to a multiplication, instead of 2^(2 * log(x - mean))
, or just write it that way instead.
What sort of data structure are you using for your data? A double**
with memory allocated for each row will lead to extra calls to (the slow function) malloc
. Also, it is more likely to lead to memory thrashing with the allocated memory being located in different places. Ideally, you should have as few calls to malloc
for as large as possible blocks of memory, and using pointer arithmetic to traverse the data.
More optimizations should be possible. If you post your code, I can make some suggestions.
Upvotes: 0
Reputation: 118138
While minor improvements might be possible, I would suggest investing in learning PDL. The documentation on matrix operations may be useful.
Upvotes: 0
Reputation: 902
Have you searched CPAN? Method gsl_stats_correlation
for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.
gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n)
- This function efficiently computes the Pearson correlation coefficient between the array reference$data1
and$data2
which must both be of the same length$n
.r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }
Upvotes: 0