Reputation: 55
I'm using SMOTE to resample a binary class TARGET_FRAUD
which includes values 0 and 1. 0 has around 900 records, while 1 only has about 100 records. I want to oversample class 1 to around 800.
This is to perform some classificatioin modeling.
#fix imbalanced data
from imblearn.over_sampling import SMOTE
#bar plot of target_fraud distribution
sns.countplot('TARGET_FRAUD', data=df)
plt.title('Before Resampling')
plt.show()
#Synthetic Minority Over-Sampling Technique
sm = SMOTE()
# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_resample(df.drop('TARGET_FRAUD', axis=1), df['TARGET_FRAUD'])
resampled_df = pd.concat([pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)], axis=1)
resampled_df.columns = df.columns
sns.countplot('TARGET_FRAUD', data=resampled_df)
plt.title('After Resampling')
plt.show()
This is the count of values before resampling:
TARGET_FRAUD:
0 898
1 102
This is the count of values after resampling:
1.000000 1251
0.000000 439
0.188377 1
0.228350 1
0.577813 1
0.989742 1
0.316744 1
0.791926 1
0.970161 1
0.757886 1
0.089506 1
0.567179 1
0.331502 1
0.563530 1
0.882599 1
0.918105 1
0.613229 1
0.239910 1
0.487373 1
...
Why is it producing random float values between 0 and 1? I only want it to return int values of 0 and 1.
Upvotes: 0
Views: 766
Reputation: 11
I do not have your dataset but based on your code I made a reproducible example. I cannot replicate what you are writing.
from collections import Counter
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
X, y = make_classification(random_state=0, weights=[0.9, 0.1])
df = pd.DataFrame(X)
df["TARGET_FRAUD"] = y
print("Before resampling")
print(Counter(df["TARGET_FRAUD"]))
sm = SMOTE()
# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_resample(
df.drop("TARGET_FRAUD", axis=1), df["TARGET_FRAUD"]
)
resampled_df = pd.concat(
[pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)],
axis=1,
)
print("Before resampling")
print(Counter(resampled_df["TARGET_FRAUD"]))
which prints
Before resampling
Counter({0: 90, 1: 10})
Before resampling
Counter({0: 90, 1: 90})
Upvotes: 1