FeatureUnion: keep existing features plus add new engineered features (aka transformed columns)

Question

Say I have a dataset with 2 numerical columns and I would like to add a third column which is the product of the two (or some other function of the two existing columns). I can compute the new feature using a ColumnTransformer:

tfs = make_column_transformer(
  (FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)

(X is a pandas DataFrame, therefore the indexing via column names. Note also the reshape I had to use. Maybe someone has a better idea there.)

As written above I would like to keep the original features (similar to what sklearn.preprocessing.PolynomialFeatures is doing), i.e. use all 3 columns to fit a linear model (or generally use them in an sklearn pipeline). How do I do this?

For example,

df = pd.DataFrame({'colX': [3, 4], 'colY': [2, 1]}) 
tfs = make_column_transformer(
  (FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
tfs.fit_transform(df)

gives

array([[1.        ],
      [1.73205081]])

but I would like to get an array that includes the original columns to pass this on in the pipeline.

The only way I could think of is a FeatureUnion with an identity transformation for the first two columns. Is there a more direct way?

(I would like to make a pipeline rather than change the DataFrame so that I do not forget to make the augmentation when calling model.predict().)

FeatureUnion: keep existing features plus add new engineered features (aka transformed columns)

Answers (1)

Related Questions