asl
asl

Reputation: 481

How to do a random stratified sampling with Python (Not a train/test split)?

I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).

Python is my main language.

Thank you for any help

Upvotes: 16

Views: 35480

Answers (3)

yosemite sam
yosemite sam

Reputation: 1

You could do this without scikit-learn using a function similar to this:

import pandas as pd
import numpy as np

def stratified_sampling(df, strata_col, sample_size):
    groups = df.groupby(strata_col)
    sample = pd.DataFrame()
    
    for _, group in groups:
        stratum_sample = group.sample(frac=sample_size, replace=False, random_state=7)
        sample = sample.append(stratum_sample)
    
    return sample

In the above:

  • df is the DataFrame to be sampled
  • strata_col is the column representing the strata (e.g 'gender') of intereest
  • sample_size is the desired sample size (e.g 0.2 for 20% of the data)

You could then call stratified_sampling as follows:

sample = stratified_sampling(df_to_be_sampled, 'gender', 0.2)

This will return a new DataFrame called sample containing the randomly sampled data. Note I've chosen random_state=7 for testing and reproducibility but this is of course arbitrary.

Upvotes: -1

Furkan Gursoy
Furkan Gursoy

Reputation: 117

Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.

Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.

If you test the following method, please share whether it behaves as expected or not.

from sklearn.model_selection import train_test_split

stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])

Upvotes: 10

asl
asl

Reputation: 481

This is my best solution so far. It is important to bin continuous variables before and to have a minimum of observations for each stratum.

In this example, I am :

  1. Generating a population
  2. Sampling in a pure random way
  3. Sampling in a random stratified way

When comparing both samples, the stratified one is much more representative of the overall population.

If anyone has an idea of a more optimal way to do it, please feel free to share.


import pandas as pd
import numpy as np

# Generate random population (100K)

population = pd.DataFrame(index=range(0,100000))
population['income'] = 0
population['income'].iloc[39000:80000] = 1
population['income'].iloc[80000:] = 2
population['sex'] = np.random.randint(0,2,100000)
population['age'] = np.random.randint(0,4,100000)

pop_count = population.groupby(['income', 'sex', 'age'])['income'].count()

# Random sampling (100 observations out of 100k)

random_sample = population.iloc[
    np.random.randint(
        0, 
        len(population), 
        int(len(population) / 1000)
    )
]

# Random Stratified Sampling (100 observations out of 100k)

stratified_sample = list(map(lambda x : population[
    (
        population['income'] == pop_count.index[x][0]
    ) 
    &
    (
        population['sex'] == pop_count.index[x][1]
    )
    &
    (
        population['age'] == pop_count.index[x][2]
    )
].sample(frac=0.001), range(len(pop_count))))

stratified_sample = pd.concat(stratified_sample)

Upvotes: 4

Related Questions