Reputation: 365
I am trying to calculate correlation amongst three columns in a dataset. The dataset is relatively large (4 GB in size). When I calculate correlation among the columns of interest, I get small values like 0.0024, -0.0067 etc. I am not sure this result makes any sense or not. Should I sample the data and then try calculating correlation? Any thoughts/experience on this topic would be appreciated.
Upvotes: 0
Views: 175
Reputation: 77474
There is nothing special about correlation of large data sets. All you need to do is some simple aggregation.
If you want to improve your numerical precision (remember that floating point math is lossy) you can use Kahan summation and similar techniques, in particular for values close to 0.
But maybe your data justt doesn't have strong correlation?
Try visualizing a sample!
Upvotes: 0
Reputation: 3264
Firstly, make sure you're applying the right formula for correlation. Remember, given vectors x and y, correlation is ((x-mean(x)) * (y - mean(y)))/(length(x)*length(y)), where * represents the dot-product and length(x) is the square root of the sum of the squares of the terms in x. (I know that's silly, but noticing a mis-typed formula is a lot easier than redoing a program.)
Do you have a strong hunch that there should be some correlation among these columns? If you don't, then those small values are reasonable. On the other hand, if you're pretty sure that there ought to be a strong correlation, then try sampling a random 100 pairs and either finding the correlation there, or plotting them for visual inspection, which can also show you if there is correlation present.
Upvotes: 1