AweSIM
AweSIM

Reputation: 1703

multiple dimension correlation in c#

ive got two N-dimensional series of points, each of length M.. the objective is to correlate them and calculate the correlation coefficient.. using formulae for variance, covariance and standard deviation, it is possible to calculate the correlation coefficient..

what i dont understand is how to adapt the algorithm to account for all N dimensions instead of just one.. consider the following..

series A = [0, 0] [1, 1] [2, 2] [3, 3]  
series B = [0,  0] [1, -1] [2, -2] [3, -3]

if we use only the first dimension for correlation, we'll get +1.00.. if we use the second, we'll get -1.00.. but we can see that if we were to consider both dimensions for correlation, the answer won't be as simple as +1.00 or -1.00..

so i wanna know how to formulate this sort of multiple-dimension correlation, preferably in c#..

feel free to ask for further clarifications or edit to improve the post further.. =)

EDIT: the series im using are stock time series.. i retrieve the latest M samples of CLOSE prices as series A and start correlating it with all historic data as a sliding window (data[1] to data[M+1], data[2] to data[M+2], data[1000] to data[M+1000], and so on).. the offset where the correlation is highest is the point in time where the price behaviour was almost identical to now.. by analyzing if the price moved up or down after that time instance, we can make a prediction which way the price might make a move at this time instant.. but im not using just CLOSE prices (1-dimension).. i want to identify regions where a number of metrics were similar, for instance CLOSE, VOLUME, etc.. so the time series doesnt have just one value for every index but a whole array of values..

if i use just CLOSE in correlation, i cant guarantee if the VOLUME sequence of these series will be similar too.. likewise if i use VOLUME in correlation, i cant guarantee if the CLOSE sequence of these series will be similar too.. so i need a formula for normalized correlation which is based on some sort of distance metric.. something like a^2 + b^2.. if the CLOSE values are similar, a^2 will be small.. if the VOLUME values are similar, b^2 will be small.. now if a^2 + b^2 is small, it means both CLOSE and VOLUME are similar..

previously what i was doing was as follows:
1. use CLOSE prices to calculate correlation.
2. use VOLUME to calculate correlation.
3. multiply these values together.. this will ensure that high correlation values will imply that both CLOSE and VOLUME have strong individual correlations..

EDIT:

stdDevX = Sqrt (Summation ((x - Mean(x)) * (x - Mean(x)) / N)
stdDevY = Sqrt (Summation ((y - Mean(y)) * (y - Mean(y)) / N)  
corrXY = Summation ((x - Mean(x) * (y - Mean(y)) / (stdDevX * stdDevY)) / (N - 1)  

http://en.wikipedia.org/wiki/Standard_deviation
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

the above formulae assume that both series x and y are one-dimensional.. my main concern is how to adapt these formulae for multi-dimensional vectors.. i wish to use it to find regions where all price metrics are similar in history.. but it can be used by anyone who wishes to correlate any sort of vertors.. x,y,z coordinates of an object, etc..

Upvotes: 1

Views: 2912

Answers (2)

Matthew Strawbridge
Matthew Strawbridge

Reputation: 20630

It's not clear from the question, but I think you are being asked to treat each series separately. So considering just Series A as a sequence of samples from a pair of variables X and Y, the two variables are completely tied (if you drew a scatter plot, all the values would be on a straight line from bottom-left to top-right) so the correlation is +1.

In contrast, considering just Series B as another sequence of samples from X and Y, this time a scatter plot would again be a straight line from top-left to bottom right. Increasing X decreases Y. The correlation is -1.

It gets more interesting if each series contains samples from three variables (for example, snapshots of the prices of three stocks over time). Here is a simple example:

            X  Y  Z   X  Y   Z   X  Y   Z   X  Y   Z
series C = [0, 0, 0] [1, 1, -1] [2, 2, -2] [3, 3, -3]

Here, you need to consider the correlation between each pair of variables. In this simple case, the correlation between X and Y is +1, between X and Z is -1 and between Y and Z is -1.

Edit: Combining correlations

Suppose you have samples from three variables – close, high and low – for two time periods and want to know how good a match the two periods are. You could calculate the correlations between the two time periods for each variable in the traditional way. Suppose this yields close-correlation = 0.6, high-correlation = 0.3, and low-correlation = 0.4.

You need some method of combining the individual correlations into a goodness of fit score in such a way that individual correlations far from zero (i.e. highly correlated, either positively or negatively) have a bigger contribution to the score than those close to zero. Simple approaches include taking the product (0.6 * 0.3 * 0.4 = 0.072) or the root-mean-square (sqrt((0.6^2 + 0.3^2 + 0.4^2) / 3) = 0.4509) – you'll have to experiment to find the method that gives you the most reliable results.

Upvotes: 2

Mharlin
Mharlin

Reputation: 1705

int GetCorrelationScore(Array[,] seriesA, Array[,] seriesB)
{
   int correlationScore = 0;

   for (var i = 0, i < seriesA.Length; i++)
   {
      if (areEqual(seriesA[i][0], seriasB[i][0], 0.5m) && areEqual(seriesA[i][1], seriasB[i][1], 0.5m))
         correlationScore++;
      else
         correlationScore--;
   }
}

bool areEqual(decimal value1, decimal value2, decimal allowedVariance)
{
   var lowValue1 = value1 - allowedVariance;
   var highValue1 = value1 + allowedVariance;

   return (lowValue1 < value2 && highValue1 > value2)
}

Upvotes: 0

Related Questions