Reputation: 323
I compute Pearson correlation (average user/item rating) many times, using my current code performance is very bad:
public double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
{
if (x.Length != y.Length)
throw new ArgumentException("values must be the same length");
double sumNum = 0;
double sumDenom = 0;
double denomX = 0;
double denomY = 0;
for (int a = 0; a < x.Length; a++)
{
sumNum += (x[a] - meanX[a]) * (y[a] - meanY[a]);
denomX += Math.Pow(x[a] - meanX[a], 2);
denomY += Math.Pow(y[a] - meanY[a], 2);
}
var sqrtDenomX = Math.Sqrt(denomX);
var sqrtDenomY = Math.Sqrt(denomY);
if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;
sumDenom = Math.Sqrt(denomX) * Math.Sqrt(denomY);
var correlation = sumNum / sumDenom;
return correlation;
}
I am using standard Pearson correlation with MathNet.Numerics
, but this is modification of standard and it's not possible to use it. Is there a way to speed it up? How can it be optimizied regarding to time complexity?
Upvotes: 2
Views: 105
Reputation: 2916
Adding some on MSE answer -- changing Pow(x,2)
to diff*diff
is definitely something you want to do, you may also want to avoid unnecessary bound-checking in your inner-most loop. This may be done using pointers in C#.
Could be done this way:
public unsafe double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
{
if (x.Length != y.Length)
throw new ArgumentException("values must be the same length");
double sumNum = 0;
double sumDenom = 0;
double denomX = 0;
double denomY = 0;
double diffX;
double diffY;
int len = x.Length;
fixed (double* xptr = &x[0], yptr = &y[0], meanXptr = &meanX[0], meanYptr = &meanY[0])
{
for (int a = 0; a < len; a++)
{
diffX = (xptr[a] - meanXptr[a]);
diffY = (yptr[a] - meanYptr[a]);
sumNum += diffX * diffY;
denomX += diffX * diffX;
denomY += diffY * diffY;
}
}
var sqrtDenomX = Math.Sqrt(denomX);
var sqrtDenomY = Math.Sqrt(denomY);
if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;
sumDenom = sqrtDenomX * sqrtDenomY;
var correlation = sumNum / sumDenom;
return correlation;
}
Upvotes: 2
Reputation: 345
The only possible optimizations that I see in your code are in the following code, if you are still looking for better performance then you may want use SIMD vectorization. It will allow you to use the full computation power of the CPU
public double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
{
if (x.Length != y.Length)
throw new ArgumentException("values must be the same length");
double sumNum = 0;
double sumDenom = 0;
double denomX = 0;
double denomY = 0;
double diffX;
double diffY;
for (int a = 0; a < x.Length; a++)
{
diffX = (x[a] - meanX[a]);
diffY = (y[a] - meanY[a]);
sumNum += diffX * diffY;
denomX += diffX * diffX;
denomY += diffY * diffY;
}
var sqrtDenomX = Math.Sqrt(denomX);
var sqrtDenomY = Math.Sqrt(denomY);
if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;
sumDenom = sqrtDenomX * sqrtDenomY;
var correlation = sumNum / sumDenom;
return correlation;
}
Upvotes: 1
Reputation: 7656
The best way to solve your performance problems is probably to avoid computing as many correlations, if possible. If you are using the correlations as part of another computation, it may be possible to use math to remove the need for some of them.
You should also consider whether you will be able to use the square of the Pearson correlation instead of the Pearson correlation itself. That way, you can save your calls to Math.Sqrt()
, which are usually quite expensive.
If you do need to take the square root, you should use sqrtDenomX
and sqrtDenomY
again, rather than recompute the square roots.
Upvotes: 1