unbik
unbik

Reputation: 186

How to take sample of data from very unbalanced DataFrame so as to not lose too many '1'?

I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).

How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.

How can I do that in Python ?

Upvotes: 0

Views: 806

Answers (3)

Khaled DELLAL
Khaled DELLAL

Reputation: 921

Assume you want a sample size = 1000

Try to use the following line :

df.sample(frac=1000/len(df), replace=True, random_state=1)

Upvotes: 1

Mohammed Hadani
Mohammed Hadani

Reputation: 1

I think the solution is to combine Oversampling and Undersampling.

Random Oversampling: Randomly duplicate examples in the minority class.

Random Undersampling: Randomly delete examples in the majority class.

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)
under = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)

Upvotes: 0

Nikolay Zakirov
Nikolay Zakirov

Reputation: 1584

Perhaps this is what you need. stratify param makes sure you sample your data in a stratified fashion as you need

from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(30000, 2)
y = np.random.randint(2, size =30000)
skf = train_test_split(X, y, train_size=100, test_size=100, stratify=y, shuffle=True)

Upvotes: 0

Related Questions