Qubix
Qubix

Reputation: 4353

Sample pandas dataframe by column value

I have a pandas dataframe, named ratings_full, of the form:

userID   nr_votes
123      12
124      14
234      22
346      35
763      45
238      1
127      17

I want to sample this dataframe, as it contains tens of thousands of users. I want to extract 100 users, but to somehow prioritize the ones with a lower value of nr_votes, without sampling only them. So a kind of "stratified sampling" on nr_votes. Is it possible?

This is all I managed so far:

SAMPLING_FRACTION = 0.0007

uid_samples = ratings_top['userId'] \
                        .drop_duplicates() \
                        .sample(frac=SAMPLING_FRACTION, 
                                replace=False, 
                                random_state=1)
ratings_sample = pd.merge(ratings_full, uid_samples, on='userId', how='inner')

This only provides a random sampling across userID's, but not a way to make sure the sampling is somehow stratified.

EDIT: I would even be happy if we can split the nr_votes into N buckets and we perform stratified sampling on the buckets.

EDIT 2 I am trying now this:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X=ratings_full.drop([nr_votes], axis=1),
             y=ratings_full.nr_votes, 
             test_size=0.33, 
             random_state=42, 
             stratify=y)

Then I have to put the dataframes back together. It's not an ideal answer but it may work. I will even try to bucket first and use the bucket column as my "labels".

Upvotes: 2

Views: 1100

Answers (2)

Mahsa Hassankashi
Mahsa Hassankashi

Reputation: 2139

from sklearn.model_selection import StratifiedShuffleSplit

n_splits = 1 
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits, 
                                                 test_size=0.1,
                                                 random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]

Upvotes: 0

BENY
BENY

Reputation: 323226

We can do np.random.choice by doing the index slice

n = len(ratings_top)
idx = np.random.choice(ratings_top.index.values, p=ratings_top['probability'], size=n*0.0007, replace=True)

Then

sample_df = df.loc[idx].copy()

Upvotes: 1

Related Questions