Mark
Mark

Reputation: 49

Passing `sample_weight` parameter to classifier in imblearn pipeline when using over/under sampling transformer

Context: I am using imblearn Pipeline as follows

        # Synthetic Minority Over-sampling Technique for Nominal and Continuous features
        features_cat_mask = np.in1d(self.X_features, self.X_features_cat)
        self.imbalance_transformer = SMOTENC(categorical_features=features_cat_mask)

        # Add binary column indicators for categorical features
        self.column_transformer = compose.make_column_transformer(
            (preprocessing.OneHotEncoder(handle_unknown='ignore',
                                         sparse=False), self.X_features_cat),
            remainder='passthrough')

        # Impute NaN values
        simple_imputer = SimpleImputer(strategy='median')

        model = RandomForestClassifier(n_jobs=-1,
                                       criterion='entropy',
                                       class_weight='balanced_subsample')

        self.clf = Pipeline(steps=[("imbalance_transformer", self.imbalance_transformer),
                       ("column_transformer", self.column_transformer),
                       ("simple_imputer", simple_imputer),
                       ("classifier", model)])

Previously before using imblearn SMOTENC I passed sample_weight using the following technique:

        self.clf.fit(self.X_train,
                     self.y_train,
                     classifier__sample_weight=self.sample_weight)

Where self.sample_weight was defined based on a column in the original dataframe that produces X_train and y_train (column = 'sample_weight').

However, since using imblearn, the number of rows output from imblearn is NOT equal to the number of rows in original datafram where sample_weight comes from. I get the following error: ValueError: sample_weight.shape == (1208,), expected (1830,)!

Question: What are some recommended techniques for passing sample_weight to the model when using an imblearn transformer (that changes the number of rows in the dataframe passed to the RF model).

Upvotes: 0

Views: 355

Answers (0)

Related Questions