Reputation: 173
I have a pandas dataframe that looks like this:
| Cliid | Segment | Insert |
|-------|---------|--------|
| 001 | A | 0 |
| 002 | A | 0 |
| 003 | C | 0 |
| 004 | B | 1 |
| 005 | A | 0 |
| 006 | B | 0 |
I want to split it into 2 groups in a way that each group has the same composition of each variable in [Segment, Insert]. For example, each group would have 1/2 of the observations belonging to segment A, 1/6 of Insert = 1, and so on.
I've checked this answer, but it only stratifies for one variable, it won't work for more than one.
R has this function that does exactly that, but using R is not an option.
By the way, I'm using Python 3.
Upvotes: 11
Views: 13376
Reputation: 1015
You can use sklearn's train_test_split function including the parameter stratify
which can be used to determine the columns to be stratified.
For example:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df1, test_size=0.2, stratify=df[["Segment", "Insert"]])
Upvotes: 13