Reputation: 11
I am working in a binary classification problem and I'm using a large data set (the number of samples is large, the features not so much). The data is imbalance but I am using a weight array that fixes this issue (sort of).
I've been trying some classifiers with sklearn in a small version of this set, and apparently SVM works good for what i want. However, once I try to fit a SVM in the whole data set, it takes forever (and I also run out of memory).
What I want to know is if there's any fast way in Sklearn to divide this set, let's say in 10 subsets, maintaining the proportion of the classes, so can I then divide each of this subsets into training/testing and fit the SVM independently for each subset (so i could use different processors too)?
Upvotes: 0
Views: 1262
Reputation: 16966
StratifiedKFold
function can serve your requirement. It will split the data into k number of stratified folds. call the _iter_test_masks()
or _make_test_folds()
,
Based on documentation:
>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> skf = StratifiedKFold(n_splits=2, shuffle=True)
>>> for i in skf._iter_test_masks(X, y):
... print(i)
[ True False False True]
[False True True False]
>>> for i in skf._make_test_folds(X, y):
... print(i)
1
0
0
1
Upvotes: 1
Reputation: 7410
You could add a new column
which will be a random number
from 0 to 1 with np.random.random_sample
, and then you can group by
the class
and apply pd.cut
to the generated random number like creating a new column dataset
:
df = pd.DataFrame({'class': np.random.choice(['A', 'B'], 100),
'value': np.random.random_sample(100)})
df['dataset'] = pd.DataFrame(df.groupby('class').apply(lambda x:
pd.cut(x['value'], 10, labels=range(0, 10)))).reset_index(0, drop=True)
Upvotes: 1