Mandallaz
Mandallaz

Reputation: 55

Resampling (boostrap) a data set of continious data for regression problem

For a regression problem, I have a training data set with : - 3 variables with a gaussian distribution - 20 variables with a uniform distribution.

All my variables are continious, between [0;1].

The problem is the test data, used to score my regression model has an uniform distribution for all the variables. Actually, I have bad results at tail-end distribution, so I want to oversample my training set, in order to duplicate the rarest rows.

So my idea is to bootstrap (using sampling with replacement) on my training set in order to have a set of data with the same distribution as the test set.

In order to do that, my idea (don't know if it's a good one !) is to add 3 columns with intervals for my 3 variables and use this columns to stratify the resampling.

Example : First, generating the data

from scipy.stats import truncnorm
def get_truncated_normal(mean=0.5, sd=0.15, min_value=0, max_value=1):
    return truncnorm(
        (min_value - mean) / sd, (max_value - mean) / sd, loc=mean, scale=sd)

generator = get_truncated_normal()


import numpy as np
from sklearn.preprocessing import MinMaxScaler
S1 = generator.rvs(1000)
S2 = generator.rvs(1000)
S3 = generator.rvs(1000)
u = np.random.uniform(0, 1, 1000)

Then check the distribution :

import seaborn as sns
sns.distplot(u);
sns.distplot(S2);

It's OK, so I'll add categories columns

import pandas as pd
df = pd.DataFrame({'S1':S1,'S2':S2,'S3':S3,'Unif':u})

BINS_NUMBER = 10
df['S1_range'] = pd.cut(df.S1, 
                            bins=BINS_NUMBER, 
                            precision=6,
                            right=True, 
                            include_lowest=True)
df['S2_range'] = pd.cut(df.S2, 
                            bins=BINS_NUMBER, 
                            precision=6,
                            right=True, 
                            include_lowest=True)
df['S3_range'] = pd.cut(df.S3, 
                            bins=BINS_NUMBER, 
                            precision=6,
                            right=True, 
                            include_lowest=True)

a check

df.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709]      3
(0.116709, 0.210454]                 15
(0.210454, 0.304199]                 64
(0.304199, 0.397944]                152
(0.397944, 0.491689]                254
(0.491689, 0.585434]                217
(0.585434, 0.679179]                173
(0.679179, 0.772924]                 86
(0.772924, 0.866669]                 30
(0.866669, 0.960414]                  6
dtype: int64

It's good for me. So now I'll try to resample but it's not working as intended

from sklearn.utils import resample
df_resampled = resample(df,replace=True,n_samples=1000, stratify=df['S1_range'])
df_resampled.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709]      3
(0.116709, 0.210454]                 15
(0.210454, 0.304199]                 64
(0.304199, 0.397944]                152
(0.397944, 0.491689]                254
(0.491689, 0.585434]                217
(0.585434, 0.679179]                173
(0.679179, 0.772924]                 86
(0.772924, 0.866669]                 30
(0.866669, 0.960414]                  6
dtype: int64

So it's not working, I get the same distribution in output as in input...

Can you help me ? Perhaps it's not the good way to do this ?

Thanks !!

Upvotes: 2

Views: 1126

Answers (2)

ProteinGuy
ProteinGuy

Reputation: 1942

Rather than writing code from scratch to resample your continuous data, you should take advantage a library for resampling regression data.

Whereas the popular libraries (imbalanced-learn, etc), focus on classification (categorical) variables, there is a recent Python library (called resreg - RESampling for REGression) that allows you to resample your continuous data (resreg GitHub page)

Also, rather than bootstraping, you may want to generate synthetic data points at the tail ends of your normally distributed variables, as doing this will likely lead to much better results (see this paper). Similar to SMOTE for classification, which interpolates between features, you can use SMOTER (SMOTE for regression) in the resreg package to generate synthetic values in regression/continuous data.

Here is an example of how you would use resreg to achieve resampling with a few lines of code:


import numpy as np
import resreg


cl = np.percentile(y,10)  # Oversample values less than the 10th percentile
ch = np.percentile(y,90)  # Oversample values less than the 10th percentile


# Assign relevance scores to indicate which samples in your dataset are
# to be resampled. Values below cl and above ch are assigned a relevance 
# value above 0.5, other values are assigned a relevance value above 0.5

relevance = resreg.sigmoid_relevance(X, y, cl=cl, ch=ch)


# Resample the relevant values (i.e relevance >= 0.5) by interpolating 
# between nearest k-neighbors (k=5). By setting over='balance', the 
# relevant values are oversampled so that the number of relevant and
# irrelevant values are equal

X_res, y_res = resreg.smoter(X, y, relevance=relevance, relevance_threshold=0.5, k=5, over='balance', random_state=0)


Upvotes: 2

Mandallaz
Mandallaz

Reputation: 55

My solution:

def create_sampled_data_set(n_samples_by_bin=1000,
                            n_bins=10,
                            replace=True,
                            save_csv=True):
    """In order to have the same distribution for S1..S3 between training
    set and test set, this function will generate a new
    training set resampled

    Return: (X_train, y_train)
    """
    def stratified_sample_df_(df, col, n_samples, replace=True):
        if replace:
            n = n_samples
        else:
            n = min(n_samples, df[col].value_counts().min())

        df_ = df.groupby(col).apply(lambda x: x.sample(n, replace=replace))
        df_.index = df_.index.droplevel(0)
        return df_

    X_train, y_train = load_data_for_train()

    # merge the dataframe for the sampling. Target will be removed after
    X_train = pd.merge(
        X_train, y_train[['Target']], left_index=True, right_index=True)
    del y_train

    # build a categorical feature, from S1..S3 distribution
    disc = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans')
    disc.fit(X_train[['S1', 'S2', 'S3']])
    y_bin = disc.transform(X_train[['S1', 'S2', 'S3']])
    del disc
    vint = np.vectorize(np.int)
    y_bin = vint(y_bin)

    y_concat = []
    for i in range(len(y_bin)):
        a = y_bin[i, 0].astype('str')
        b = y_bin[i, 1].astype('str')
        c = y_bin[i, 2].astype('str')
        y_concat.append(a + ';' + b + ';' + c)
    del y_bin

    X_train['S_Class'] = y_concat
    del y_concat

    X_train_resampled = stratified_sample_df_(
        X_train, 'S_Class', n_samples_by_bin)
    del X_train
    y_train_resampled = X_train_resampled[['Target']].copy()
    y_train_resampled.rename(
        columns={y_train_resampled.columns[0]: 'Target'}, inplace=True)

    X_train_resampled = X_train_resampled.drop(['S_Class', 'Target'], axis=1)

    # save in file for further usage
    if save_csv:
        X_train_resampled.to_csv(
            "./data/training_input_resampled.csv", sep=",")
        y_train_resampled.to_csv(
            "./data/training_output_resampled.csv", sep=",")

    return(X_train_resampled,
           y_train_resampled)

Upvotes: 0

Related Questions