Reputation: 4630
I would like to muck up two different approaches of a classification algorithm as this documentation example. This is what I tried:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2))
import pandas as pd
df = pd.read_csv('/data.csv',
header=0, sep=',', names=['SentenceId', 'Sentence', 'Sentiment'])
X = tfidf_vect.fit_transform(df['Sentence'].values)
y = df['Sentiment'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
from sklearn.svm import SVC
#first svm
clf = SVC(kernel='linear')
clf.fit(reduced_data, y)
prediction = clf.predict(X_test)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-10, 10)
yy = a * xx - clf.intercept_[0] / w[1]
# get the separating hyperplane using weighted classes
#second svm
wclf = SVC(kernel='linear', class_weight={5: 10},C=1000)
wclf.fit(reduced_data, y)
weighted_prediction = wclf.predict(X_test)
#PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
import matplotlib.pyplot as plt
h0 = plt.plot(xx, yy, 'k-', label='no weights')
h1 = plt.plot(xx, wyy, 'k--', label='with weights')
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
plt.legend()
plt.axis('tight')
plt.show()
But I get the following exception:
Traceback (most recent call last):
File "file.py", line 25, in <module>
a = -w[0] / w[1]
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/csr.py", line 253, in __getitem__
return self._get_row_slice(row, col)
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/csr.py", line 320, in _get_row_slice
raise IndexError('index (%d) out of range' % i)
IndexError: index (1) out of range
How can I plot this task correctly in 2-D or 3-D with matplotlib?. I also tried this, but clearly this is wrong:
Thanks in advance, this is the data I am using to do this.
When I print w
this is what happen:
(0, 911) -0.352103548716
a = -w[0] / w[1]
(0, 2346) -1.20396753467
File "/usr/local/lib/python2.7/site-packages/scipy/sparse/csr.py", line 253, in __getitem__
(0, 2482) -0.352103548716
(0, 2288) -0.733605938797
(0, 1175) -0.868966214318
(0, 1936) -0.500071158622
(0, 2558) -0.40965370142
(0, 788) -0.485330735934
(0, 322) -0.575610464517
(0, 453) -0.584854414882
(0, 1913) -0.300076915818
(0, 2411) -0.419065159403
(0, 2017) -0.407926583824
(0, 2363) -0.407926583824
(0, 815) -1.09245625795
(0, 543) -0.248207856236
(0, 1082) -0.366433457602
(0, 1312) -0.286768829333
(0, 1525) -0.286768829333
(0, 1677) -0.286768829333
(0, 2679) -0.688619491265
(0, 413) -0.101096807406
(0, 1322) -0.13561265293
(0, 1488) -0.120403497624
(0, 1901) -0.337806267742
: :
(0, 1609) 0.100116485705
(0, 581) 0.276579777388
(0, 2205) 0.241642287418
(0, 1055) 0.0166785719624
(0, 2390) 0.349485515339
(0, 1866) 0.357035248059
(0, 2098) 0.296454010725
(0, 2391) 0.45905660273
(0, 2601) 0.357035248059
(0, 619) 0.350880030278
(0, 129) 0.287439419266
(0, 280) 0.432180530894
(0, 1747) -0.172314049543
(0, 1211) 0.573579514463
(0, 86) 0.3152907757
(0, 452) 0.305881204557
(0, 513) 0.212678772368
(0, 946) -0.347372778859
(0, 1194) 0.298193025133
(0, 2039) 0.34451957335
(0, 2483) 0.245366213834
(0, 317) 0.355996551812
(0, 977) 0.355996551812
(0, 1151) 0.284383826645
(0, 2110) 0.120512273328
It returned a very large sparse matrix.
Upvotes: 1
Views: 301
Reputation: 222
If w is a sparse matrix, you have to access it as such. Try:
a = -w[0,0] / w[0, 1]
Although I have to warn you: the example you are following for visualization is of a very simple 2D problem. For the visualization you have in mind to make any sense at all, you would have to perform dimensionality reduction (such as PCA) before visualizing the problem. While you can obviously plot the first 2 coordinates of your 12k dimensions, the chances of these happening to be the most informative dimensions is virtually 0.
EDIT: looking at your w
matrix, this will still not work, but at least now it should give a division by zero problem, rather than an index out of range. Thinking about it a bit more, I am not quite sure how to solve your problem. If your aim is to visualize your data, you can use PCA to reduce your data to 2D first, and then run an SVM to find a separator (with and without weights), but it is unlikely that your SVM parameters will generalize to your actual problem. On the other hand, you can run your SVM in the higher dimension and use this to color your solution in your PCA. The best case, this will give you two fairly well separated coloured groups: in this case your SVM is working really well, and PCA maintains most of the structure of your problem. However, if one of these two conditions does not hold (and particularly the latter is unlikely to hold for most problems), you will get an almost random pattern in your colours. In this case you cannot really draw any conclusions at all.
EDIT 2: I have written a short script for extracting 2 PCA dimensions and plotting them for you. Note that I don't reduce to 2, I reduce to 10000, and then extract the first 2. In practice it doesn't make a lot of difference (my code is less efficient), but this way it allows me to illustrate a point: if you reduce to 10,000 dimensions you lose no representative power, meaning you have about 2k useless dimensions (or more, didn't try to reduce the PCA further). However, reducing to 2 goes far too far: then you are left with a power of 0.07, which is far too low to do anything much useful with. As you see in the plot. Note that if you zoom in something fierce on the plot, there seems to be a linear correlation between the first 2 components of your PCA reduction. Unfortunately I am not a good enough statistician to tell you what that means. If I had to take a guess, I would say that you have quite a bit of covariance in your data, but this is a total stab in the dark.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2))
import pandas as pd
df = pd.read_csv('corpus.txt',
header=0, sep=',', names=['SentenceId', 'Sentence', 'Sentiment'])
X = tfidf_vect.fit_transform(df['Sentence'].values)
y = df['Sentiment'].values
from sklearn.decomposition import PCA
pca = PCA(n_components=10000)
reduced = pca.fit_transform(X.toarray())
print sum(pca.explained_variance_ratio_)
print pca.explained_variance_ratio_[0] + pca.explained_variance_ratio_[1]
from matplotlib import pyplot as plt
by_class = {}
for i in range(0, len(y)):
if not y[i] in by_class:
by_class[y[i]] = []
by_class[y[i]].append(reduced[i])
for c in by_class:
toplt = np.array(by_class[c]).T
plt.plot(toplt[0], toplt[1], linestyle='', marker='o')
plt.show()
Upvotes: 2
Reputation: 144
w = clf.coef_[0]
a = -w[0] / w[1]
It would seem your 'w' list only contains one value. This would be the reason you are receiving an error when trying to access the second index w[1].
Upvotes: 2