Johannes Wiesner
Johannes Wiesner

Reputation: 1307

Imbalanced-Learn's FunctionSampler throws ValueError

I want to use the class FunctionSampler from imblearn to create my own custom class for resampling my dataset.

I have a one-dimensional feature Series containing paths for each subject and a label Series containing the labels for each subject. Both come from a pd.DataFrame. I know that I have to reshape the feature array first since it is one-dimensional.

When I use the class RandomUnderSampler everything works fine, however if I pass both the features and labels first to the fit_resample method of FunctionSampler which then creates an instance of RandomUnderSampler and then calls fit_resample on this class, I get the following error:

ValueError: could not convert string to float: 'path_1'

Here's a minimal example producing the error:

import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from imblearn import FunctionSampler

# create one dimensional feature and label arrays X and y
# X has to be converted to numpy array and then reshaped. 
X = pd.Series(['path_1','path_2','path_3'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

FIRST METHOD (works)

rus = RandomUnderSampler()
X_res, y_res = rus.fit_resample(X,y)

SECOND METHOD (doesn't work)

def resample(X, y):
    return RandomUnderSampler().fit_resample(X, y)

sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)

Does anyone know what goes wrong here? It seems as the fit_resample method of FunctionSampler is not equal to the fit_resample method of RandomUnderSampler...

Upvotes: 1

Views: 1060

Answers (1)

Venkatachalam
Venkatachalam

Reputation: 16966

Your implementation of FunctionSampler is correct. The problem is with your dataset.

RandomUnderSampler seems to work for text data as well. There is no checking using check_X_y.

But FunctionSampler() has this check, see here

from sklearn.utils import check_X_y

X = pd.Series(['path_1','path_2','path_2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

check_X_y(X, y)

This will throw an error

ValueError: could not convert string to float: 'path_1'

The following example would work!

X = pd.Series(['1','2','2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

def resample(X, y):
    return RandomUnderSampler().fit_resample(X, y)

sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)

X_res, y_res 
# (array([[2.],
#        [1.]]), array([0, 1], dtype=int64))

Upvotes: 2

Related Questions