netdis
netdis

Reputation: 323

Performance array multiplication Pearson

I compute Pearson correlation (average user/item rating) many times, using my current code performance is very bad:

public double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
        {
            if (x.Length != y.Length)
                throw new ArgumentException("values must be the same length");

            double sumNum = 0;
            double sumDenom = 0;
            double denomX = 0;
            double denomY = 0;

            for (int a = 0; a < x.Length; a++)
            {
                sumNum += (x[a] - meanX[a]) * (y[a] - meanY[a]);
                denomX += Math.Pow(x[a] - meanX[a], 2);
                denomY += Math.Pow(y[a] - meanY[a], 2);
            }

            var sqrtDenomX = Math.Sqrt(denomX);
            var sqrtDenomY = Math.Sqrt(denomY);

            if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;

            sumDenom = Math.Sqrt(denomX) * Math.Sqrt(denomY);

            var correlation = sumNum / sumDenom;

            return correlation;
        }

I am using standard Pearson correlation with MathNet.Numerics, but this is modification of standard and it's not possible to use it. Is there a way to speed it up? How can it be optimizied regarding to time complexity?

Upvotes: 2

Views: 105

Answers (3)

Florent DUGUET
Florent DUGUET

Reputation: 2916

Adding some on MSE answer -- changing Pow(x,2) to diff*diff is definitely something you want to do, you may also want to avoid unnecessary bound-checking in your inner-most loop. This may be done using pointers in C#.

Could be done this way:

    public unsafe double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
    {
        if (x.Length != y.Length)
            throw new ArgumentException("values must be the same length");

        double sumNum = 0;
        double sumDenom = 0;
        double denomX = 0;
        double denomY = 0;
        double diffX;
        double diffY;

        int len = x.Length;

        fixed (double* xptr = &x[0], yptr = &y[0], meanXptr = &meanX[0], meanYptr = &meanY[0])
        {
            for (int a = 0; a < len; a++)
            {
                diffX = (xptr[a] - meanXptr[a]);
                diffY = (yptr[a] - meanYptr[a]);
                sumNum += diffX * diffY;
                denomX += diffX * diffX;
                denomY += diffY * diffY;
            }
        }

        var sqrtDenomX = Math.Sqrt(denomX);
        var sqrtDenomY = Math.Sqrt(denomY);

        if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;

        sumDenom = sqrtDenomX * sqrtDenomY;

        var correlation = sumNum / sumDenom;

        return correlation;
    }

Upvotes: 2

MSE
MSE

Reputation: 345

The only possible optimizations that I see in your code are in the following code, if you are still looking for better performance then you may want use SIMD vectorization. It will allow you to use the full computation power of the CPU

public double ComputeCorrelation(double[] x, double[] y, double[] meanX, double[] meanY)
    {
        if (x.Length != y.Length)
            throw new ArgumentException("values must be the same length");

        double sumNum = 0;
        double sumDenom = 0;
        double denomX = 0;
        double denomY = 0;
        double diffX;
        double diffY;

        for (int a = 0; a < x.Length; a++)
        {
            diffX = (x[a] - meanX[a]);
            diffY = (y[a] - meanY[a]);
            sumNum += diffX * diffY;
            denomX += diffX * diffX;
            denomY += diffY * diffY;
        }

        var sqrtDenomX = Math.Sqrt(denomX);
        var sqrtDenomY = Math.Sqrt(denomY);

        if (sqrtDenomX == 0 || sqrtDenomY == 0) return 0;

        sumDenom = sqrtDenomX * sqrtDenomY;

        var correlation = sumNum / sumDenom;

        return correlation;
    }

Upvotes: 1

J&#248;rgen Fogh
J&#248;rgen Fogh

Reputation: 7656

The best way to solve your performance problems is probably to avoid computing as many correlations, if possible. If you are using the correlations as part of another computation, it may be possible to use math to remove the need for some of them.

You should also consider whether you will be able to use the square of the Pearson correlation instead of the Pearson correlation itself. That way, you can save your calls to Math.Sqrt(), which are usually quite expensive.

If you do need to take the square root, you should use sqrtDenomX and sqrtDenomY again, rather than recompute the square roots.

Upvotes: 1

Related Questions