user5739619
user5739619

Reputation: 1838

Python IndexError: index 1 is out of bounds for LDA

I have a dataset that looks like this:

    Out  Revolver   Ratio     Num ...
0   1    0.766127   0.802982  0   ...
1   0    0.957151   0.121876  1 
2   0    0.658180   0.085113  0 
3   0    0.233810   0.036050  3 
4   1    0.907239   0.024926  5 
...

Out can only take only values 0 and 1. I then tried to generate PCA and LCA plots using the code below that is similar to here: http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html

features = Train.columns[1:]
Xf = newTrain[features]
yf = newTrain.Out
pca = PCA(n_components=2)
X_r = pca.fit(Xf).transform(Xf)
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(Xf, yf).transform(Xf)

plt.figure()
for c, i, name in zip("rgb", [0, 1], names):
    plt.scatter(X_r[yf == i, 0], X_r[yf == i, 1], c=c, label=name)
plt.legend()
plt.title('PCA plt')

plt.figure()
for c, i, name in zip("rgb", [0, 1], names):
    plt.scatter(X_r2[yf == i, 0], X_r2[yf == i, 1], c=c, label=name)
plt.legend()
plt.title('LDA plt')

I can get the PCA plot to work. However, it doesn't make sense as it only shows 2 dots. One at around (-4000, 30) and the other at (2400, 23.7). I don't see a bunch of data points like in the plot in that link

The LDA plot doesn't work and gives the error

IndexError: index 1 is out of bounds for axis 1 with size 1

I also tried the code below to generate an LDA plot but got the same error

for c, i, name in zip("rgb", [0, 1], names):
    plt.scatter(x=X_LDA_sklearn[:, 0][yf==i], y=X_LDA_sklearn[:, 1][yf==i], c=c, label=name)
plt.legend()

Anyone know what's wrong with this?

EDIT: Here are my imports

import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.lda import LDA

As for where the errors are occuring:

I get

FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
plt.scatter(X_r[yf == i,0], X_r[yf == i, 1], c=c, label=name)

at the line inside the for loop for the PCA plot

As for the LDA at the line

plt.scatter(X_r2[yf == i, 0], X_r2[yf == i, 1], c=c, label=name)

I get

FutureWarning: in the future, boolean array-likes will be handled as a boolean array index

and

IndexError: index 1 is out of bounds for axis 1 with size 1

Upvotes: 4

Views: 1978

Answers (1)

Cleb
Cleb

Reputation: 25997

The reason why you see this error is that X_r2 consists of only one column (at least given the data you provide). In the command y=X_LDA_sklearn[:, 1][yf==i], however, you try to access the second column which therefore throws the error you observe.

I added a third class to the example data you provided (with two classes a dimensionality reduction is not that reasonable) and also converted your dataframes to arrays. It now runs through nicely and produces the following plots (not that informative due to the small amount of data):

enter image description here enter image description here

Here is the updated code:

import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

trainDF = pd.DataFrame({'Out': [1, 0, 0, 0, 1, 3, 3],
                        'Revolver': [0.766, 0.957, 0.658, 0.233, 0.907, 0.1, 0.15],
                        'Ratio': [0.803, 0.121, 0.085, 0.036, 0.024, 0.6, 0.8],
                        'Num': [0, 1, 0, 3, 5, 4, 4]})
#drop NA values
trainDF = trainDF.dropna()

trainDF['Num'].loc[(trainDF['Num']==8) | (trainDF['Num']==17)] = trainDF['Num'].median()

# convert dataframe to numpy array
y = trainDF['Out'].as_matrix()

# convert dataframe to numpy array
X = trainDF.drop('Out', 1).as_matrix()

target_names = ['out', 'in']

pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)

lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

plt.figure()
for c, i, target_name in zip("rgb", [0, 1], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('PCA of Out')

plt.figure()
for c, i, target_name in zip("rgb", [0, 1], target_names):
    plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('LDA of Out')

plt.show()

So, when you run into these "Index out of Bounds" errors, always check the dimensions of your arrays first.

Upvotes: 2

Related Questions