Reputation: 3318
I want to plot the effect of removing samples (rows). Some people call it a "learning curve".
So I thought of using Pandas to remove some rows. How to remove, randomly, rows from a dataframe but from each label?
But when I want to do cross validation, I get the following error (even after using df.values
to turn the dataframe into an array):
So, what am I doing wrong?
Here is my code:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation
df = pd.DataFrame(np.random.rand(12, 5))
label = np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))
X = df1[[0, 1, 2, 3, 4]].values
y = df1.label.values
print(X)
print(y)
clf = neighbors.KNeighborsClassifier()
sss = StratifiedShuffleSplit(1, test_size=0.1)
scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss)
print(scoresSSS)
Upvotes: 0
Views: 2869
Reputation: 1447
Right off the bat, with sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)
you're generating an object, not an iterable:
>>> type(sss)
<class 'sklearn.model_selection._split.StratifiedShuffleSplit'>
Instead of giving the StratifiedShuffleSplit
class your entire object (which obviously isn't iterable, thus the error), you need to give it the train/test output of the class's .split()
method (docs).
Further, your test_size
param in your StratifiedShuffleSplit
class is too small. Using 0.1
as you have will throw a ValueError
because you have 3 unique classes, therefore 0.1
for a test size won't do. And lastly, you're using the default n_neighbors
param value in your KNeighbors clf
object. This default value is too large when using such a small data set. Using what you have will throw another ValueError
due to n_neighbors <= n_samples
. So in my example below I've upped the test size in your StratifiedShuffleSplit
object, dropped n_neighbors
down to 2, and passed the iterables from sss.split(X, y)
to cross_validation.cross_val_score
's cv
param.
So here is what you want your code to look like:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn import neighbors
from sklearn import cross_validation
df = pd.DataFrame(np.random.rand(12, 5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])
df['label'] = label
df1 = pd.concat(g.sample(2) for idx, g in df.groupby('label'))
X = df1[[0,1,2,3,4]].values
y = df1.label.values
clf = neighbors.KNeighborsClassifier(n_neighbors=2)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.35)
scoresSSS = cross_validation.cross_val_score(clf, X, y, cv=sss.split(X, y))
print(scoresSSS)
Let me just say that I have no idea what score you're looking to get, and by no means am I claiming that this will optimize your score. However, this will help you get rid of those errors so you can get back to work.
Upvotes: 2