Reputation: 7752
I'm trying to use the sklearn_pandas module to extend the work I do in pandas and dip a toe into machine learning but I'm struggling with an error I don't really understand how to fix.
I was working through the following dataset on Kaggle.
It's essentially an unheadered table (1000 rows, 40 features) with floating point values.
import pandas as pdfrom sklearn import neighbors
from sklearn_pandas import DataFrameMapper, cross_val_score
path_train ="../kaggle/scikitlearn/train.csv"
path_labels ="../kaggle/scikitlearn/trainLabels.csv"
path_test = "../kaggle/scikitlearn/test.csv"
train = pd.read_csv(path_train, header=None)
labels = pd.read_csv(path_labels, header=None)
test = pd.read_csv(path_test, header=None)
mapper_train = DataFrameMapper([(list(train.columns),neighbors.KNeighborsClassifier(n_neighbors=3))])
mapper_train
Output:
DataFrameMapper(features=[([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
n_neighbors=3, p=2, weights='uniform'))])
So far so good. But then I try the fit
mapper_train.fit_transform(train, labels)
Output:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-e3897d6db1b5> in <module>()
----> 1 mapper_train.fit_transform(train, labels)
//anaconda/lib/python2.7/site-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
409 else:
410 # fit method of arity 2 (supervised transformation)
--> 411 return self.fit(X, y, **fit_params).transform(X)
412
413
//anaconda/lib/python2.7/site-packages/sklearn_pandas/__init__.pyc in fit(self, X, y)
116 for columns, transformer in self.features:
117 if transformer is not None:
--> 118 transformer.fit(self._get_col_subset(X, columns))
119 return self
120
TypeError: fit() takes exactly 3 arguments (2 given)`
What am I doing wrong? While the data in this case is all the same, I'm planning to work up a workflow for mixtures categorical, nominal and floating point features and sklearn_pandas seemed to be a logical fit.
Upvotes: 1
Views: 4173
Reputation: 7185
Since sklearn_pandas
doesn't currently support estimators accepting an y
vector with labels, you will have to use it only to transform all features to a Numpy matrix and then use the KNeighborsClassifier
in a separate step.
UPDATE 2015-08-10 - sklearn_pandas
DataFrameMapper
isn't intended to be used as a pipeline for transformation + model fitting, but only for transforming the columns selectively. If you want to transform then estimate a model, use a plain sklearn
Pipeline
with the dataframe mapper as first step.
Upvotes: 1