Reputation: 75
I am trying to implement machine learning for a dataset with 1059 rows and 4 columns but I am getting the following error when trying to fit the model with:
knn.fit(myData['RAB'], myData['ETAPE'])
ValueError: Found input variables with inconsistent numbers of samples: [1, 1059]
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. Also how can I define multiple predictor variables?
The output of shape is:
(1059, 4)
How can I define more than one predictor variables?
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
myData=pd.read_csv('sabmin.csv', sep=';')
print(myData.shape)
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(myData['RAB'], myData['ETAPE'])
Upvotes: 0
Views: 210
Reputation: 33512
You are doing it wrong according to sklearn's expected shapes.
Here:
knn.fit(myData['RAB'], myData['ETAPE'])
it seems your are giving one series as input, one as output. Probably not what you want as sklearn will take it as one sample with 1059 dimensions. sklearn's error output is compatible with my guess.
It's hard to know what exactly you are doing, but you need at least to reshape from (1, 1059) to (1059, 1). But i would have also expected you want to make use of more columns, but i don't know.
Alternatively you could create a numpy-matrix earlier to make it easier (myData.as_matrix()
) (i'm more of a numpy-based user with sklearn; but many people use pandas because of this name-based indexing).
The former would be something like:
knn.fit(myData['RAB'].reshape(-1, 1), myData['ETAPE'])
I really recommend reading sklearn's docs (one of the best docs ever) and probably also pandas & numpy's docs too to know what's happening exactly.
You may observe that sklearn's huge array of examples are mostly based on numpy-inputs. This is easier for beginners as using pandas is one more layer of complexity (DataFrames, Series, ...).
Upvotes: 2