Reputation: 1838
I have a dataset that looks like this:
Out Revolver Ratio Num ...
0 1 0.766127 0.802982 0 ...
1 0 0.957151 0.121876 1
2 0 0.658180 0.085113 0
3 0 0.233810 0.036050 3
4 1 0.907239 0.024926 5
...
Out
can only take only values 0 and 1.
I then tried to generate PCA and LCA plots using the code below that is similar to here: http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html
features = Train.columns[1:]
Xf = newTrain[features]
yf = newTrain.Out
pca = PCA(n_components=2)
X_r = pca.fit(Xf).transform(Xf)
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(Xf, yf).transform(Xf)
plt.figure()
for c, i, name in zip("rgb", [0, 1], names):
plt.scatter(X_r[yf == i, 0], X_r[yf == i, 1], c=c, label=name)
plt.legend()
plt.title('PCA plt')
plt.figure()
for c, i, name in zip("rgb", [0, 1], names):
plt.scatter(X_r2[yf == i, 0], X_r2[yf == i, 1], c=c, label=name)
plt.legend()
plt.title('LDA plt')
I can get the PCA plot to work. However, it doesn't make sense as it only shows 2 dots. One at around (-4000, 30) and the other at (2400, 23.7). I don't see a bunch of data points like in the plot in that link
The LDA plot doesn't work and gives the error
IndexError: index 1 is out of bounds for axis 1 with size 1
I also tried the code below to generate an LDA plot but got the same error
for c, i, name in zip("rgb", [0, 1], names):
plt.scatter(x=X_LDA_sklearn[:, 0][yf==i], y=X_LDA_sklearn[:, 1][yf==i], c=c, label=name)
plt.legend()
Anyone know what's wrong with this?
EDIT: Here are my imports
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.lda import LDA
As for where the errors are occuring:
I get
FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
plt.scatter(X_r[yf == i,0], X_r[yf == i, 1], c=c, label=name)
at the line inside the for loop for the PCA plot
As for the LDA at the line
plt.scatter(X_r2[yf == i, 0], X_r2[yf == i, 1], c=c, label=name)
I get
FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
and
IndexError: index 1 is out of bounds for axis 1 with size 1
Upvotes: 4
Views: 1978
Reputation: 25997
The reason why you see this error is that X_r2
consists of only one column (at least given the data you provide). In the command y=X_LDA_sklearn[:, 1][yf==i]
, however, you try to access the second column which therefore throws the error you observe.
I added a third class to the example data you provided (with two classes a dimensionality reduction is not that reasonable) and also converted your dataframes to arrays. It now runs through nicely and produces the following plots (not that informative due to the small amount of data):
Here is the updated code:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
trainDF = pd.DataFrame({'Out': [1, 0, 0, 0, 1, 3, 3],
'Revolver': [0.766, 0.957, 0.658, 0.233, 0.907, 0.1, 0.15],
'Ratio': [0.803, 0.121, 0.085, 0.036, 0.024, 0.6, 0.8],
'Num': [0, 1, 0, 3, 5, 4, 4]})
#drop NA values
trainDF = trainDF.dropna()
trainDF['Num'].loc[(trainDF['Num']==8) | (trainDF['Num']==17)] = trainDF['Num'].median()
# convert dataframe to numpy array
y = trainDF['Out'].as_matrix()
# convert dataframe to numpy array
X = trainDF.drop('Out', 1).as_matrix()
target_names = ['out', 'in']
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s'
% str(pca.explained_variance_ratio_))
plt.figure()
for c, i, target_name in zip("rgb", [0, 1], target_names):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('PCA of Out')
plt.figure()
for c, i, target_name in zip("rgb", [0, 1], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
plt.legend()
plt.title('LDA of Out')
plt.show()
So, when you run into these "Index out of Bounds" errors, always check the dimensions of your arrays first.
Upvotes: 2