I have a Pandas DataFrame like below with ID and Target variable (for machine learning model). My DataFrame is really large and unbalanced. I need to make sampling on my DataFrame because it is really large Balancing the DataFrame looks like this: 99.60% - 0 0.40 % - 1 ID TARGET 111 1 222 1 333 0 444 1 ... ... How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data. How can I do that in Python ?

Reputation: 186

How to take sample of data from very unbalanced DataFrame so as to not lose too many '1'?

I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).

My DataFrame is really large and unbalanced.
I need to make sampling on my DataFrame because it is really large
Balancing the DataFrame looks like this:
- 99.60% - 0
- 0.40 % - 1
  
  ID TARGET
  
  111 1
  
  222 1
  
  333 0
  
  444 1
  
  ... ...

ID	TARGET
111	1
222	1
333	0
444	1
...	...

How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.

How can I do that in Python ?

Upvotes: 0

Answers (3)

Khaled DELLAL

Reputation: 921

Assume you want a sample size = 1000

Try to use the following line :

df.sample(frac=1000/len(df), replace=True, random_state=1)

Upvotes: 1

Mohammed Hadani

Reputation: 1

I think the solution is to combine Oversampling and Undersampling.

Random Oversampling: Randomly duplicate examples in the minority class.

Random Undersampling: Randomly delete examples in the majority class.

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)
under = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)

Upvotes: 0

Nikolay Zakirov

Reputation: 1584

Perhaps this is what you need. stratify param makes sure you sample your data in a stratified fashion as you need

from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(30000, 2)
y = np.random.randint(2, size =30000)
skf = train_test_split(X, y, train_size=100, test_size=100, stratify=y, shuffle=True)

Upvotes: 0

How to take sample of data from very unbalanced DataFrame so as to not lose too many &#39;1&#39;?

Answers (3)

Related Questions

How to take sample of data from very unbalanced DataFrame so as to not lose too many '1'?