arthur
arthur

Reputation: 173

Pandas stratified sampling based on multiple columns

I have a pandas dataframe that looks like this:

| Cliid | Segment | Insert |
|-------|---------|--------|
| 001   | A       | 0      |
| 002   | A       | 0      |
| 003   | C       | 0      |
| 004   | B       | 1      |
| 005   | A       | 0      |
| 006   | B       | 0      |

I want to split it into 2 groups in a way that each group has the same composition of each variable in [Segment, Insert]. For example, each group would have 1/2 of the observations belonging to segment A, 1/6 of Insert = 1, and so on.

I've checked this answer, but it only stratifies for one variable, it won't work for more than one.

R has this function that does exactly that, but using R is not an option.

By the way, I'm using Python 3.

Upvotes: 11

Views: 13376

Answers (1)

Jannik
Jannik

Reputation: 1015

You can use sklearn's train_test_split function including the parameter stratify which can be used to determine the columns to be stratified.

For example:

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df1, test_size=0.2, stratify=df[["Segment", "Insert"]])

Upvotes: 13

Related Questions