Reputation: 171
I'm trying to emulate the np.cov
function by implementing covariance matrix from scratch. However, my code doesn't seem to give the same output as np.cov
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('C:/Users/User/Downloads/Admission_Predict.csv')
X = df.values
N, M = X.shape
means = np.zeros(M) # M many of them
stdevs = np.zeros(M)
Xcoeff = np.zeros((M, M))
# Mean
for i in range(M):
means[i] = np.sum(X[:, i]) / N
stdevs[i] = math.sqrt(sum(pow(x-means[i], 2) for x in X[:, i]) / (N-1))
# Covariance matrix
for j in range(M):
mat0 = mat[i][j] - [means][0].reshape(M, -1)
covariance = (mat0 * mat0.T) / (N-1)
Desired matrix values
print(np.cov(df))
> [[14128.00654107 13533.16488393 13222.07435357 ... 13831.92691786
> 13050.78170893 13961.07189821] [13533.16488393 12968.32105536 12670.19783929 ... 13249.25808929
> 12505.84390893 13372.93946964] [13222.07435357 12670.19783929 12379.07033571 ... 12944.65915
> 12218.34000357 13065.526925 ] ... [13831.92691786 13249.25808929 12944.65915 ... 13542.10545
> 12777.00158214 13668.54191786] [13050.78170893 12505.84390893 12218.34000357 ... 12777.00158214
> 12060.0142125 12896.28555179] [13961.07189821 13372.93946964 13065.526925 ... 13668.54191786
> 12896.28555179 13796.19808393]]
My output matrix values
print(covariance)
> [ 3.47493270e+02 1.17319616e+02 2.64636910e+00 2.98987496e+00
> 3.04758394e+00 8.70463072e+00 -1.45646482e-01 4.87503509e-02]
Upvotes: 0
Views: 4746
Reputation: 8654
See the code below, note that you need to set rowvar=False
in np.cov
in order to calculate the covariances between the data frame columns.
import pandas as pd
import numpy as np
# Load the data
df = pd.read_csv('Admission_Predict.csv')
# Extract the data
X = df.values
# Extract the number of rows and columns
N, M = X.shape
# Calculate the covariance matrix
cov = np.zeros((M, M))
for i in range(M):
# Mean of column "i"
mean_i = np.sum(X[:, i]) / N
for j in range(M):
# Mean of column "j"
mean_j = np.sum(X[:, j]) / N
# Covariance between column "i" and column "j"
cov[i, j] = np.sum((X[:, i] - mean_i) * (X[:, j] - mean_j)) / (N - 1)
# Compare with numpy covariance matrix
np_cov = np.cov(df.values, rowvar=False)
np_cov_diff = np.sum(np.abs(cov - np_cov))
print('Difference from numpy cov. mat.: {:.12f}'.format(np_cov_diff))
# Difference from numpy cov. mat.: 0.000000000000
# Compare with pandas covariance matrix
pd_cov = df.cov().values
pd_cov_diff = np.sum(np.abs(cov - pd_cov))
print('Difference from pandas cov. mat.: {:.12f}'.format(pd_cov_diff))
# Difference from pandas cov. mat.: 0.000000000000
Upvotes: 1