Reputation: 55
For a regression problem, I have a training data set with : - 3 variables with a gaussian distribution - 20 variables with a uniform distribution.
All my variables are continious, between [0;1].
The problem is the test data, used to score my regression model has an uniform distribution for all the variables. Actually, I have bad results at tail-end distribution, so I want to oversample my training set, in order to duplicate the rarest rows.
So my idea is to bootstrap (using sampling with replacement) on my training set in order to have a set of data with the same distribution as the test set.
In order to do that, my idea (don't know if it's a good one !) is to add 3 columns with intervals for my 3 variables and use this columns to stratify the resampling.
Example : First, generating the data
from scipy.stats import truncnorm
def get_truncated_normal(mean=0.5, sd=0.15, min_value=0, max_value=1):
return truncnorm(
(min_value - mean) / sd, (max_value - mean) / sd, loc=mean, scale=sd)
generator = get_truncated_normal()
import numpy as np
from sklearn.preprocessing import MinMaxScaler
S1 = generator.rvs(1000)
S2 = generator.rvs(1000)
S3 = generator.rvs(1000)
u = np.random.uniform(0, 1, 1000)
Then check the distribution :
import seaborn as sns
sns.distplot(u);
sns.distplot(S2);
It's OK, so I'll add categories columns
import pandas as pd
df = pd.DataFrame({'S1':S1,'S2':S2,'S3':S3,'Unif':u})
BINS_NUMBER = 10
df['S1_range'] = pd.cut(df.S1,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S2_range'] = pd.cut(df.S2,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S3_range'] = pd.cut(df.S3,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
a check
df.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
It's good for me. So now I'll try to resample but it's not working as intended
from sklearn.utils import resample
df_resampled = resample(df,replace=True,n_samples=1000, stratify=df['S1_range'])
df_resampled.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
So it's not working, I get the same distribution in output as in input...
Can you help me ? Perhaps it's not the good way to do this ?
Thanks !!
Upvotes: 2
Views: 1126
Reputation: 1942
Rather than writing code from scratch to resample your continuous data, you should take advantage a library for resampling regression data.
Whereas the popular libraries (imbalanced-learn, etc), focus on classification (categorical) variables, there is a recent Python library (called resreg - RESampling for REGression) that allows you to resample your continuous data (resreg GitHub page)
Also, rather than bootstraping, you may want to generate synthetic data points at the tail ends of your normally distributed variables, as doing this will likely lead to much better results (see this paper). Similar to SMOTE for classification, which interpolates between features, you can use SMOTER (SMOTE for regression) in the resreg package to generate synthetic values in regression/continuous data.
Here is an example of how you would use resreg to achieve resampling with a few lines of code:
import numpy as np
import resreg
cl = np.percentile(y,10) # Oversample values less than the 10th percentile
ch = np.percentile(y,90) # Oversample values less than the 10th percentile
# Assign relevance scores to indicate which samples in your dataset are
# to be resampled. Values below cl and above ch are assigned a relevance
# value above 0.5, other values are assigned a relevance value above 0.5
relevance = resreg.sigmoid_relevance(X, y, cl=cl, ch=ch)
# Resample the relevant values (i.e relevance >= 0.5) by interpolating
# between nearest k-neighbors (k=5). By setting over='balance', the
# relevant values are oversampled so that the number of relevant and
# irrelevant values are equal
X_res, y_res = resreg.smoter(X, y, relevance=relevance, relevance_threshold=0.5, k=5, over='balance', random_state=0)
Upvotes: 2
Reputation: 55
My solution:
def create_sampled_data_set(n_samples_by_bin=1000,
n_bins=10,
replace=True,
save_csv=True):
"""In order to have the same distribution for S1..S3 between training
set and test set, this function will generate a new
training set resampled
Return: (X_train, y_train)
"""
def stratified_sample_df_(df, col, n_samples, replace=True):
if replace:
n = n_samples
else:
n = min(n_samples, df[col].value_counts().min())
df_ = df.groupby(col).apply(lambda x: x.sample(n, replace=replace))
df_.index = df_.index.droplevel(0)
return df_
X_train, y_train = load_data_for_train()
# merge the dataframe for the sampling. Target will be removed after
X_train = pd.merge(
X_train, y_train[['Target']], left_index=True, right_index=True)
del y_train
# build a categorical feature, from S1..S3 distribution
disc = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans')
disc.fit(X_train[['S1', 'S2', 'S3']])
y_bin = disc.transform(X_train[['S1', 'S2', 'S3']])
del disc
vint = np.vectorize(np.int)
y_bin = vint(y_bin)
y_concat = []
for i in range(len(y_bin)):
a = y_bin[i, 0].astype('str')
b = y_bin[i, 1].astype('str')
c = y_bin[i, 2].astype('str')
y_concat.append(a + ';' + b + ';' + c)
del y_bin
X_train['S_Class'] = y_concat
del y_concat
X_train_resampled = stratified_sample_df_(
X_train, 'S_Class', n_samples_by_bin)
del X_train
y_train_resampled = X_train_resampled[['Target']].copy()
y_train_resampled.rename(
columns={y_train_resampled.columns[0]: 'Target'}, inplace=True)
X_train_resampled = X_train_resampled.drop(['S_Class', 'Target'], axis=1)
# save in file for further usage
if save_csv:
X_train_resampled.to_csv(
"./data/training_input_resampled.csv", sep=",")
y_train_resampled.to_csv(
"./data/training_output_resampled.csv", sep=",")
return(X_train_resampled,
y_train_resampled)
Upvotes: 0