Reputation: 5354
I have a data set that contain numeric values. I'd like to measure the correlation between the columns
Let's consider :
dataset = pd.DataFrame({'A':np.random.rand(100)*1000,
'B':np.random.rand(100)*100,
'C':np.random.rand(100)*10,
't':np.random.rand(100)})
Mathematically, non-correlated data means that cov(a,b) = 0. But with real data, it should be near to zero.
np.cov(a,b)
this numpy should give us the covariance value between two. but I'd like to make sure that my dataset is not correlated, any trick to do that ?
UPDATE
from matplotlib.mlab import PCA
results = PCA(dataset.values)
Upvotes: 2
Views: 4358
Reputation: 1675
I have a covariance code snipet that I refer to:
mean = np.mean(matrix,axis=0)
# make a mean matrix the same shape as data for subtraction
mean_mat = np.outer(np.ones((nsamples,1)),mean)
cov = matrix - mean_mat
cov = np.dot(cov.T,cov)/(nsamples -1)
cov
is the numpy array, mean
is the mean in the row direction.
Note the matrix doesn't need to be square.
Then you can use the Covariance matrix to " take out the variance" by multiplying the data by the inverse covariance using the Penrose pseudo inverse:
U,S,V = svd(cov)
D = np.diag(1./S)
# inv = VT.(D^-1).UT
# where cov = U.D.V
inverse_cov = np.dot(V.T,np.dot(D,U.T))
Upvotes: 2