Reputation: 5551
I am attempting to classify some data with the scikit learn LDA classifier. I'm not entirely sure what to "expect" from it, but what I am getting is weird. Seems like a good opportunity to learn about either a shortcoming of the technique, or a way in which I am applying it wrong. I understand that no line could completely separate this data, but it seems that there are much "better" lines than the one it is finding. I'm just using the default options. Any thoughts on how to do this better? I'm using LDA because it is linear in the size of my dataset. Although I think a linear SVM has a similar complexity. Perhaps it would be better for such data? I will update when I have tested other possibilities.
The picture: (light blue is what my LDA classifier predicts will be dark blue)
The code:
import numpy as np
from numpy import array
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import itertools
X = array([[ 0.23125754, 0.79170351],
[ 0.78021491, -0.24999486],
[ 0.00856446, 0.41452734],
[ 0.66381753, -0.09872504],
[-0.03178685, 0.04876317],
[ 0.65574645, -0.68214948],
[ 0.14290684, 0.38256002],
[ 0.05156987, 0.11094875],
[ 0.06843403, 0.19110019],
[ 0.24070898, -0.07403764],
[ 0.03184353, 0.4411446 ],
[ 0.58708124, -0.38838008],
[-0.00700369, 0.07540799],
[-0.01907816, 0.07641038],
[ 0.30778608, 0.30317186],
[ 0.55774143, -0.38017325],
[-0.00957214, -0.03303287],
[ 0.8410637 , 0.158594 ],
[-0.00294113, -0.00380608],
[ 0.26577841, 0.07833684],
[-0.32249375, 0.49290502],
[ 0.11313078, 0.35697211],
[ 0.41153679, -0.4471876 ],
[-0.00313315, 0.30065913],
[ 0.14344143, -0.19127107],
[ 0.04857767, 0.01339191],
[ 0.5865007 , 0.71209886],
[ 0.08157439, 0.40909955],
[ 0.72495202, 0.29583866],
[-0.09391461, 0.17976605],
[ 0.06149141, 0.79323099],
[ 0.52208024, -0.2877661 ],
[ 0.01992141, -0.00435266],
[ 0.68492617, -0.46981335],
[-0.00641231, 0.29699622],
[ 0.2369677 , 0.140319 ],
[ 0.6602586 , 0.11200433],
[ 0.25311836, -0.03085372],
[-0.0895014 , 0.45147252],
[-0.18485667, 0.43744524],
[ 0.94636701, 0.16534406],
[ 0.01887734, -0.07702135],
[ 0.91586801, 0.17693792],
[-0.18834833, 0.31944796],
[ 0.20468328, 0.07099982],
[-0.15506378, 0.94527383],
[-0.14560083, 0.72027034],
[-0.31037647, 0.81962815],
[ 0.01719756, -0.01802322],
[-0.08495304, 0.28148978],
[ 0.01487427, 0.07632112],
[ 0.65414479, 0.17391618],
[ 0.00626276, 0.01200355],
[ 0.43328095, -0.34016614],
[ 0.05728525, -0.05233956],
[ 0.61218382, 0.20922571],
[-0.69803697, 2.16018536],
[ 1.38616732, -1.86041621],
[-1.21724616, 2.72682759],
[-1.26584365, 1.80585403],
[ 1.67900048, -2.36561699],
[ 1.35537903, -1.60023078],
[-0.77289615, 2.67040114],
[ 1.62928969, -1.20851808],
[-0.95174264, 2.51515935],
[-1.61953649, 2.34420531],
[ 1.38580104, -1.9908369 ],
[ 1.53224512, -1.96537012]])
y = array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.])
classifier = LDA()
classifier.fit(X,y)
xx = np.array(list(itertools.product(np.linspace(-4,4,300), np.linspace(-4,4,300))))
yy = classifier.predict(xx)
b_colors = ['salmon' if yyy==0 else 'deepskyblue' for yyy in yy]
p_colors = ['r' if yyy==0 else 'b' for yyy in y]
plt.scatter(xx[:,0],xx[:,1],s=1,marker='o',edgecolor=b_colors,c=b_colors)
plt.scatter(X[:,0], X[:,1], marker='o', s=5, c=p_colors, edgecolor=p_colors)
plt.show()
UPDATE: Changing from using sklearn.discriminant_analysis.LinearDiscriminantAnalysis
to sklearn.svm.LinearSVC
also using the default options gives the following picture:
I think using the zero-one loss instead of the hinge loss would help, but sklearn.svm.LinearSVC
doesn't seem to allow custom loss functions.
UPDATE: The loss function to sklearn.svm.LinearSVC
approaches the zero-one loss as the parameter C
goes to infinity. Setting C = 1000
gives me what I was originally hoping for. Not posting this as an answer, because the original question was about LDA.
picture:
Upvotes: 3
Views: 367
Reputation: 9645
LDA models each class as a Gaussian, so the model for each class is determined by the class' estimated mean vector and covariance matrix. Judging by the eye only, your blue and red classes have approximately the same mean and same covariance, which means the 2 Gaussians will 'sit' on top of each other, and the discrimination will be poor. Actually it also means that the separator (the blue-pink border) will be noisy, that is it will change a lot between random samples of your data.
Btw your data is clearly not linearly-separable, so every linear model will have a hard time discriminating the data.
If you must use a linear model, try using LDA with 3 components, such that the top-left blue blob is classified as '0', the bottom-right blue blob as '1', and the red as '2'. This way you will get a much better linear model. You can do it by preprocessing the blue class with a clustering algorithm with K=2 classes.
Upvotes: 1