Reputation: 73
I am trying to run an SVM linear kernel using a generated dataset. My dataset has 5000 rows and 4 columns:
CL_scaled.head()[screenshot of data frame][1]
I split the data into 20% test and 80% training:
train, test = train_test_split(CL_scaled, test_size=0.2)
and get a shape of (4000,4) for train and (1000,4) for test
However, when I run the svm on the training and testing data, I get the following error:
svclassifier = SVC(kernel='linear', C = 5)
svclassifier.fit(train, test)
ValueError Traceback (most recent call last)
<ipython-input-81-4c4a7bdcbe85> in <module>
----> 1 svclassifier.fit(train, test)
~/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py in fit(self, X, y, sample_weight)
144 X, y = check_X_y(X, y, dtype=np.float64,
145 order='C', accept_sparse='csr',
--> 146 accept_large_sparse=False)
147 y = self._validate_targets(y)
148
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
722 dtype=None)
723 else:
--> 724 y = column_or_1d(y, warn=True)
725 _assert_all_finite(y)
726 if y_numeric and y.dtype.kind == 'O':
~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
758 return np.ravel(y)
759
--> 760 raise ValueError("bad input shape {0}".format(shape))
761
762
ValueError: bad input shape (1000, 4)
Can someone please let me know what is wrong with my code or data? Thanks in advance!
train.head()
0 1 2 3
2004 1.619999 1.049560 1.470708 -1.323666
1583 1.389370 -0.788002 -0.320337 -0.898712
1898 -1.436903 0.994719 0.326256 0.495565
892 1.419123 1.522091 1.378514 -1.731400
4619 0.063095 1.527875 -1.285816 -0.823347
test.head()
0 1 2 3
1118 -1.152435 -0.484851 -0.996602 1.617749
4347 -0.519430 -0.479388 1.483582 -0.413985
2220 -0.966766 -1.459475 -0.827581 0.849729
204 1.759567 -0.113363 -1.618555 -1.383653
3578 0.329069 1.151323 -0.652328 1.666561
print(test.shape)
print(train.shape)
(1000, 4)
(4000, 4)
Upvotes: 0
Views: 3736
Reputation: 6270
You are missing the basic concept of supervised machine learning.
In a classification problem you have features X and with them you want to predict a class Y. For example this can look like this:
X y
Height Weight class
170 50 1
180 60 1
10 10 0
The idea for algorithms is that they have a training part (you go to the soccer training to train) and a test part (you test your skills on the field on the weekend).
Therefore your need to split your data, into training and test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(CL_scaled[:-1], CL_scaled[-1], test_size=0.2)
CL_scaled[:-1]
is your X, and CL_scalded[-1]
is your Y.
Then you are using this to fit your classifier (training part):
svclassifier = SVC(kernel='linear', C = 5)
svclassifier.fit(X_train, y_train)
And then you can test it:
prediction = svcclassifier.predict(X_test, y_test)
This will return your prediction for your test part (y_predict) and you can measure it against your y_test.
Upvotes: 0
Reputation: 13426
The Error is because of train, test = train_test_split(CL_scaled, test_size=0.2)
First thing you need to separate data and output variable and pass it into train_test_split
.
# I am assuming your last column is output variable
train_test_split(CL_scaled[:-1], CL_scaled[-1], test_size=0.2).
And train_test_split
splits your data into 4 parts
X_train, X_test, y_train, y_test
Furthormore, svclassifier.fit
takes parameter independent variables and output variable. So you need to pass X_train
and y_train
So your code should be
X_train, X_test, y_train, y_test = train_test_split(CL_scaled[:-1], CL_scaled[-1], test_size=0.2)
svclassifier = SVC(kernel='linear', C = 5)
svclassifier.fit(X_train, y_train)
For more details refer documentation
Upvotes: 1