Sudhendu
Sudhendu

Reputation: 380

ValueError Scikit learn. Number of features of model don't match input

I am pretty new to machine learning in general and scikit-learn in specific.

I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html

For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:

Train_data.csv (Contains 32 columns, the last column is the output value).

Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)

Test data is one column less than training data..

I am using the following code to learn (using training data) and then predict (using test data).

The issue I am facing is the error:

*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*

Here is my code (sorry if it looks completely wrong :( )

import pandas as pd #import the library
from sklearn import svm 

mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"]  #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data


clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target)  #Code from the URL above 

test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column 

clf.predict(test_data[-1:]) #Code from the URL above

The training data csv labels looks something like this:

Value1,Value2,Value3,Value4,Output

The test data csv labels looks something like this:

Value1,Value2,Value3,Value4.

Thanks :)

Upvotes: 1

Views: 4212

Answers (1)

dooms
dooms

Reputation: 1645

Your problem is a Supervised Problem, you have some data in form of (input,output).

The input are the features describing your example and the output is the prediction that your model should respond given that input.

In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.

The general workflow in sklearn with a Supervised Problem should look like this

X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]

model.fit(X_train,Y_train)
model.score(X_test, Y_test)

To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.

You should check the shape of your data

data.shape

It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :

data = mydata.ix[:,:-1]

Upvotes: 1

Related Questions