Pandas stratified sampling based on multiple columns

Question

I have a pandas dataframe that looks like this:

| Cliid | Segment | Insert |
|-------|---------|--------|
| 001   | A       | 0      |
| 002   | A       | 0      |
| 003   | C       | 0      |
| 004   | B       | 1      |
| 005   | A       | 0      |
| 006   | B       | 0      |

I want to split it into 2 groups in a way that each group has the same composition of each variable in [Segment, Insert]. For example, each group would have 1/2 of the observations belonging to segment A, 1/6 of Insert = 1, and so on.

I've checked this answer, but it only stratifies for one variable, it won't work for more than one.

R has this function that does exactly that, but using R is not an option.

By the way, I'm using Python 3.

Pandas stratified sampling based on multiple columns

Answers (1)

Related Questions