NicolaiF
NicolaiF

Reputation: 1343

Stratified sampling in python, with constraint

I have a data frame with observations

data = [['red', 1, 0.2], ['blue', 1, 0.5], ['green', 2, 0.8], ['blue', 2, 0.55], ['blue', 2, 0.52], ['red', 3, 0.15], ['green', 3, 0.85], ['red', 4, 0.12], ['purple', 4, 0.01]] 

df = pd.DataFrame(data, columns = ['label', 'signal', 'value']) 

   label    signal  value
0   red     1   0.20
1   blue    1   0.50
2   green   2   0.80
3   blue    2   0.55
4   blue    2   0.52
5   red     3   0.15
6   green   3   0.85
7   red     4   0.12
8   purple  4   0.01

I want to do stratified k-folds sampling over the labels, but I need to do it in such a way such that no signal value is split across folds. I have done it with an implementation that just utilizes dictionaries and complicated checks. I was wondering if there was an easier way to go about this problem?

The result for K=2 could be:

batch 1
0   red     1   0.20
1   blue    1   0.50
5   red     3   0.15
6   green   3   0.85

batch 2
2  green    2   0.80
3   blue    2   0.55
4   blue    2   0.52
7   red     4   0.12
8   purple  4   0.01

where there is 2 reds, 1 blue, 1 green in batch 1 and 1 red, 2 blue, 1 green, 1 purple in batch 2. In this case the two batches are somewhat balanced in regards to the class contents which is what I want.

Upvotes: 0

Views: 496

Answers (1)

Magellan88
Magellan88

Reputation: 2573

I think you are looking for the GroupShuffleSplit function that is build into scikit-learn: sklearn.model_selection.GroupShuffleSplit

Upvotes: 1

Related Questions