Reputation: 49
Context: I am using imblearn Pipeline as follows
# Synthetic Minority Over-sampling Technique for Nominal and Continuous features
features_cat_mask = np.in1d(self.X_features, self.X_features_cat)
self.imbalance_transformer = SMOTENC(categorical_features=features_cat_mask)
# Add binary column indicators for categorical features
self.column_transformer = compose.make_column_transformer(
(preprocessing.OneHotEncoder(handle_unknown='ignore',
sparse=False), self.X_features_cat),
remainder='passthrough')
# Impute NaN values
simple_imputer = SimpleImputer(strategy='median')
model = RandomForestClassifier(n_jobs=-1,
criterion='entropy',
class_weight='balanced_subsample')
self.clf = Pipeline(steps=[("imbalance_transformer", self.imbalance_transformer),
("column_transformer", self.column_transformer),
("simple_imputer", simple_imputer),
("classifier", model)])
Previously before using imblearn SMOTENC I passed sample_weight using the following technique:
self.clf.fit(self.X_train,
self.y_train,
classifier__sample_weight=self.sample_weight)
Where self.sample_weight was defined based on a column in the original dataframe that produces X_train and y_train (column = 'sample_weight').
However, since using imblearn, the number of rows output from imblearn is NOT equal to the number of rows in original datafram where sample_weight comes from. I get the following error: ValueError: sample_weight.shape == (1208,), expected (1830,)!
Question: What are some recommended techniques for passing sample_weight to the model when using an imblearn transformer (that changes the number of rows in the dataframe passed to the RF model).
Upvotes: 0
Views: 355