melololo
melololo

Reputation: 171

Implement numpy covariance matrix from scratch

I'm trying to emulate the np.cov function by implementing covariance matrix from scratch. However, my code doesn't seem to give the same output as np.cov

Code:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Users/User/Downloads/Admission_Predict.csv')
X = df.values
N, M = X.shape

means = np.zeros(M)  # M many of them
stdevs = np.zeros(M)
Xcoeff = np.zeros((M, M))

# Mean
for i in range(M):
    means[i] = np.sum(X[:, i]) / N
    stdevs[i] = math.sqrt(sum(pow(x-means[i], 2) for x in X[:, i]) / (N-1))

    # Covariance matrix
    for j in range(M):
        mat0 = mat[i][j] - [means][0].reshape(M, -1)
        covariance = (mat0 * mat0.T) / (N-1)

Desired matrix values

print(np.cov(df))

> [[14128.00654107 13533.16488393 13222.07435357 ... 13831.92691786
>   13050.78170893 13961.07189821]  [13533.16488393 12968.32105536 12670.19783929 ... 13249.25808929
>   12505.84390893 13372.93946964]  [13222.07435357 12670.19783929 12379.07033571 ... 12944.65915
>   12218.34000357 13065.526925  ]  ...  [13831.92691786 13249.25808929 12944.65915    ... 13542.10545
>   12777.00158214 13668.54191786]  [13050.78170893 12505.84390893 12218.34000357 ... 12777.00158214
>   12060.0142125  12896.28555179]  [13961.07189821 13372.93946964 13065.526925   ... 13668.54191786
>   12896.28555179 13796.19808393]]

My output matrix values

print(covariance)

> [ 3.47493270e+02  1.17319616e+02  2.64636910e+00  2.98987496e+00
>   3.04758394e+00  8.70463072e+00 -1.45646482e-01  4.87503509e-02]

Upvotes: 0

Views: 4746

Answers (1)

user11989081
user11989081

Reputation: 8654

See the code below, note that you need to set rowvar=False in np.cov in order to calculate the covariances between the data frame columns.

import pandas as pd
import numpy as np

# Load the data
df = pd.read_csv('Admission_Predict.csv')

# Extract the data
X = df.values

# Extract the number of rows and columns
N, M = X.shape

# Calculate the covariance matrix
cov = np.zeros((M, M))

for i in range(M):

    # Mean of column "i"
    mean_i = np.sum(X[:, i]) / N

    for j in range(M):

        # Mean of column "j"
        mean_j = np.sum(X[:, j]) / N

        # Covariance between column "i" and column "j"
        cov[i, j] = np.sum((X[:, i] - mean_i) * (X[:, j] - mean_j)) / (N - 1)

# Compare with numpy covariance matrix
np_cov = np.cov(df.values, rowvar=False)
np_cov_diff = np.sum(np.abs(cov - np_cov))
print('Difference from numpy cov. mat.: {:.12f}'.format(np_cov_diff))
# Difference from numpy cov. mat.: 0.000000000000

# Compare with pandas covariance matrix
pd_cov = df.cov().values
pd_cov_diff = np.sum(np.abs(cov - pd_cov))
print('Difference from pandas cov. mat.: {:.12f}'.format(pd_cov_diff))
# Difference from pandas cov. mat.: 0.000000000000

Upvotes: 1

Related Questions