Reputation: 481
I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).
Python is my main language.
Thank you for any help
Upvotes: 16
Views: 35480
Reputation: 1
You could do this without scikit-learn using a function similar to this:
import pandas as pd
import numpy as np
def stratified_sampling(df, strata_col, sample_size):
groups = df.groupby(strata_col)
sample = pd.DataFrame()
for _, group in groups:
stratum_sample = group.sample(frac=sample_size, replace=False, random_state=7)
sample = sample.append(stratum_sample)
return sample
In the above:
You could then call stratified_sampling
as follows:
sample = stratified_sampling(df_to_be_sampled, 'gender', 0.2)
This will return a new DataFrame called sample containing the randomly sampled data. Note I've chosen random_state=7
for testing and reproducibility but this is of course arbitrary.
Upvotes: -1
Reputation: 117
Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.
Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.
If you test the following method, please share whether it behaves as expected or not.
from sklearn.model_selection import train_test_split
stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])
Upvotes: 10
Reputation: 481
This is my best solution so far. It is important to bin continuous variables before and to have a minimum of observations for each stratum.
In this example, I am :
When comparing both samples, the stratified one is much more representative of the overall population.
If anyone has an idea of a more optimal way to do it, please feel free to share.
import pandas as pd
import numpy as np
# Generate random population (100K)
population = pd.DataFrame(index=range(0,100000))
population['income'] = 0
population['income'].iloc[39000:80000] = 1
population['income'].iloc[80000:] = 2
population['sex'] = np.random.randint(0,2,100000)
population['age'] = np.random.randint(0,4,100000)
pop_count = population.groupby(['income', 'sex', 'age'])['income'].count()
# Random sampling (100 observations out of 100k)
random_sample = population.iloc[
np.random.randint(
0,
len(population),
int(len(population) / 1000)
)
]
# Random Stratified Sampling (100 observations out of 100k)
stratified_sample = list(map(lambda x : population[
(
population['income'] == pop_count.index[x][0]
)
&
(
population['sex'] == pop_count.index[x][1]
)
&
(
population['age'] == pop_count.index[x][2]
)
].sample(frac=0.001), range(len(pop_count))))
stratified_sample = pd.concat(stratified_sample)
Upvotes: 4