How to send a dataframe to scikit for cross validation?

Question

I want to plot the effect of removing samples (rows). Some people call it a "learning curve".

So I thought of using Pandas to remove some rows. How to remove, randomly, rows from a dataframe but from each label?

But when I want to do cross validation, I get the following error (even after using df.values to turn the dataframe into an array):

So, what am I doing wrong?

Here is my code:

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation

df = pd.DataFrame(np.random.rand(12, 5))
label = np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label

df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))

X = df1[[0, 1, 2, 3, 4]].values
y = df1.label.values
print(X)
print(y)

clf = neighbors.KNeighborsClassifier()
sss = StratifiedShuffleSplit(1, test_size=0.1)
scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss)
print(scoresSSS)

semore_1267 · Accepted Answer

Right off the bat, with sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35) you're generating an object, not an iterable:

>>> type(sss)

Instead of giving the StratifiedShuffleSplit class your entire object (which obviously isn't iterable, thus the error), you need to give it the train/test output of the class's .split() method (docs).

Further, your test_size param in your StratifiedShuffleSplit class is too small. Using 0.1 as you have will throw a ValueError because you have 3 unique classes, therefore 0.1 for a test size won't do. And lastly, you're using the default n_neighbors param value in your KNeighbors clf object. This default value is too large when using such a small data set. Using what you have will throw another ValueError due to n_neighbors <= n_samples. So in my example below I've upped the test size in your StratifiedShuffleSplit object, dropped n_neighbors down to 2, and passed the iterables from sss.split(X, y) to cross_validation.cross_val_score's cv param.

So here is what you want your code to look like:

import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation

df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label

df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))


X = df1[[0,1,2,3,4]].values
y = df1.label.values

clf = neighbors.KNeighborsClassifier(n_neighbors=2)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)

scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss.split(X, y))
print(scoresSSS)

Let me just say that I have no idea what score you're looking to get, and by no means am I claiming that this will optimize your score. However, this will help you get rid of those errors so you can get back to work.

How to send a dataframe to scikit for cross validation?

Answers (1)

Related Questions