Reputation: 1767
I am comparatively new to Python, Stats and using DS libraries, my requirement is to run a multicollinearity test on a dataset having n number of columns and ensure the columns/variables having VIF > 5 are dropped altogether.
I found a code which is,
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif_(X, thresh=5.0):
variables = range(X.shape[1])
tmp = range(X[variables].shape[1])
print(tmp)
dropped=True
while dropped:
dropped=False
vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
del variables[maxloc]
dropped=True
print('Remaining variables:')
print(X.columns[variables])
return X[variables]
But, I do not clearly understand, should I pass the dataset altogether in the X argument's position? If yes, it is not working.
Please help!
Upvotes: 6
Views: 17694
Reputation: 323
Firstly, thanks to @DanSan for including the idea of Parallelization in Multicollinearity computation. Now I have an at least 50% improvement in the computation time for a multi-dimensional dataset of shape (22500, 71). But I have faced one interesting challenge on a dataset I was working on. The dataset actually contains some categorical columns, which I have Binary encoded using Category-encoders, as a result of which some columns have got just 1 unique value. And for such columns, the value of VIF is non-finite or NaN !
The following snapshot shows the VIF values for some of the 71 binary-encoded columns in my dataset:
In these situations, the number of columns that will remain after using the codes by @Aakash Basu and @DanSan might sometimes become dependent on the order of the columns in the dataset, as per my bitter experience, since columns are dropped linearly based on the max VIF value. And columns with just one value is a bit stupid for any Machine Learning model, as it will forcibly impose a biasness into the system !
In order to handle this issue, you can use the following updated code:
from joblib import Parallel, delayed
from statsmodels.stats.outliers_influence import variance_inflation_factor
def removeMultiColl(data, vif_threshold = 5.0):
for i in data.columns:
if data[i].nunique() == 1:
print(f"Dropping {i} due to just 1 unique value")
data.drop(columns = i, inplace = True)
drop = True
col_list = list(data.columns)
while drop == True:
drop = False
vif_list = Parallel(n_jobs = -1, verbose = 5)(delayed(variance_inflation_factor)(data[col_list].values, i) for i in range(data[col_list].shape[1]))
max_index = vif_list.index(max(vif_list))
if vif_list[max_index] > vif_threshold:
print(f"Dropping column : {col_list[max_index]} at index - {max_index}")
del col_list[max_index]
drop = True
print("Remaining columns :\n", list(data[col_list].columns))
return data[col_list]
Best of luck!
Upvotes: 1
Reputation: 1767
I tweaked with the code and managed to achieve the desired result by the following code, with a little bit of Exception Handling -
def multicollinearity_check(X, thresh=5.0):
data_type = X.dtypes
# print(type(data_type))
int_cols = \
X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
total_cols = X.shape[1]
try:
if int_cols != total_cols:
raise Exception('All the columns should be integer or float, for multicollinearity test.')
else:
variables = list(range(X.shape[1]))
dropped = True
print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
while dropped:
dropped = False
vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
print('\n\nvif is: ', vif)
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
# del variables[maxloc]
X.drop(X.columns[variables[maxloc]], 1, inplace=True)
variables = list(range(X.shape[1]))
dropped = True
print('\n\nRemaining variables:\n')
print(X.columns[variables])
# return X.iloc[:,variables]
return X
except Exception as e:
print('Error caught: ', e)
Upvotes: 3
Reputation: 128
I also had issues running something similar. I fixed it by changing how variables
was defined and finding another way of deleting its elements.
The following script should work with Anaconda 5.0.1 and Python 3.6 (the latest version as of this writing).
import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor
from joblib import Parallel, delayed
# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
variables = [X.columns[i] for i in range(X.shape[1])]
dropped=True
while dropped:
dropped=False
print(len(variables))
vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
variables.pop(maxloc)
dropped=True
print('Remaining variables:')
print([variables])
return X[[i for i in variables]]
X = df[feature_list] # Selecting your data
X2 = calculate_vif_(X,5) # Actually running the function
If you have many features it will take very long to run. So I made another change to have it work in parallel in case you have multiple CPUs available.
Enjoy!
Upvotes: 6