ANDRES MENDEZ
ANDRES MENDEZ

Reputation: 11

How to divide large data set into n subsets mantaining the class proportion

I am working in a binary classification problem and I'm using a large data set (the number of samples is large, the features not so much). The data is imbalance but I am using a weight array that fixes this issue (sort of).

I've been trying some classifiers with sklearn in a small version of this set, and apparently SVM works good for what i want. However, once I try to fit a SVM in the whole data set, it takes forever (and I also run out of memory).

What I want to know is if there's any fast way in Sklearn to divide this set, let's say in 10 subsets, maintaining the proportion of the classes, so can I then divide each of this subsets into training/testing and fit the SVM independently for each subset (so i could use different processors too)?

Upvotes: 0

Views: 1262

Answers (2)

Venkatachalam
Venkatachalam

Reputation: 16966

StratifiedKFold function can serve your requirement. It will split the data into k number of stratified folds. call the _iter_test_masks() or _make_test_folds(),

Based on documentation:

>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> skf = StratifiedKFold(n_splits=2, shuffle=True)
>>> for i in skf._iter_test_masks(X, y):
...     print(i)

[ True False False  True]
[False  True  True False]

>>> for i in skf._make_test_folds(X, y):
...     print(i)

1
0
0
1

Upvotes: 1

Franco Piccolo
Franco Piccolo

Reputation: 7410

You could add a new column which will be a random number from 0 to 1 with np.random.random_sample, and then you can group by the class and apply pd.cut to the generated random number like creating a new column dataset:

df = pd.DataFrame({'class': np.random.choice(['A', 'B'], 100), 
               'value': np.random.random_sample(100)})
df['dataset'] = pd.DataFrame(df.groupby('class').apply(lambda x: 
pd.cut(x['value'], 10, labels=range(0, 10)))).reset_index(0, drop=True)

Upvotes: 1

Related Questions