Reputation: 149
I am having issues trying to generate a colinearity analysis on a simple DF (see below). My problem is that everytime I try to run the function, I retrieve the following error message:
KeyError: "None of [Int64Index([0, 1, 2, 3], dtype='int64')] are in the [columns]"
Below is the code I am using
read_training_set = pd.read_csv('C:\\Users\\rapha\\Desktop\\New test\\Classeur1.csv', sep=";")
training_set = pd.DataFrame(read_training_set)
print(training_set)
def calculate_vif_(X):
thresh = 5.0
variables = range(X.shape[1])
for i in np.arange(0, len(variables)):
vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]
print(vif)
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
del variables[maxloc]
print('Remaining variables:')
print(X.columns[variables])
return X
X = training_set
X2 = calculate_vif_(X)
The DF on which I am trying to run my function looks like this.
Year Age Weight Size
0 2020 10 100 170
1 2021 11 101 171
2 2022 12 102 172
3 2023 13 103 173
4 2024 14 104 174
5 2025 15 105 175
6 2026 16 106 176
7 2027 17 107 177
8 2028 18 108 178
I have two guesses here; but not sure how to fix that anyway:
-Guess 1: the np.arrange is causing some sort of conflict with the header & columns which prevents the rest of the function of iterating through each column
-Guess 2: The problem comes from blankseperators, which prevents the function from jumping from one column to another properly. The problem is that my CSV file already has ";" seperators (I do not know exactly why to be honnest as I manually created the file and saved it as a regular CSV with "," separators").
Not sure how to fix the problem at this point, does anyone has insights here?
Best
Upvotes: 1
Views: 643
Reputation: 149
Got it, I revised the whole thing and seems to be working. See below how it looks.
Thanks a lot for the help
variables = list(range(X.shape[1]))
for i in variables:
vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
for ix in range(X.iloc[:, variables].shape[1])]
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
'\' at index: ' + str(maxloc))
del variables[maxloc]
print('Remaining variables:')
print(X.columns[variables])
return X.iloc[:, variables]
X = training_set
X2 = calculate_vif_(X)```
Upvotes: 0
Reputation: 3010
The error is caused by this snippet X[variables].values
. Convert variables
, which is a range
, to a list
.
As an aside, the code is very confusing. Why are you calling np.arange
when variables
is already a range
? Why are you using a range of the number of columns to index rows?
It looks like from the comments above that you think you are indexing columns by column number, but you are actually indexing rows. Some of this confusion would be cleared up if you use loc`` or
iloc``` to be explicit about what you are trying to index.
Upvotes: 1