Reputation: 4353
I have a pandas dataframe, named ratings_full
, of the form:
userID nr_votes
123 12
124 14
234 22
346 35
763 45
238 1
127 17
I want to sample this dataframe, as it contains tens of thousands of users. I want to extract 100 users, but to somehow prioritize the ones with a lower value of nr_votes
, without sampling only them. So a kind of "stratified sampling" on nr_votes
. Is it possible?
This is all I managed so far:
SAMPLING_FRACTION = 0.0007
uid_samples = ratings_top['userId'] \
.drop_duplicates() \
.sample(frac=SAMPLING_FRACTION,
replace=False,
random_state=1)
ratings_sample = pd.merge(ratings_full, uid_samples, on='userId', how='inner')
This only provides a random sampling across userID
's, but not a way to make sure the sampling is somehow stratified.
EDIT: I would even be happy if we can split the nr_votes
into N buckets and we perform stratified sampling on the buckets.
EDIT 2 I am trying now this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X=ratings_full.drop([nr_votes], axis=1),
y=ratings_full.nr_votes,
test_size=0.33,
random_state=42,
stratify=y)
Then I have to put the dataframes back together. It's not an ideal answer but it may work. I will even try to bucket first and use the bucket column as my "labels".
Upvotes: 2
Views: 1100
Reputation: 2139
from sklearn.model_selection import StratifiedShuffleSplit
n_splits = 1
sss = model_selection.StratifiedShuffleSplit(n_splits=n_splits,
test_size=0.1,
random_state=42)
train_idx, test_idx = list(sss.split(X, y))[0]
Upvotes: 0
Reputation: 323226
We can do np.random.choice
by doing the index slice
n = len(ratings_top)
idx = np.random.choice(ratings_top.index.values, p=ratings_top['probability'], size=n*0.0007, replace=True)
Then
sample_df = df.loc[idx].copy()
Upvotes: 1