fuenfundachtzig
fuenfundachtzig

Reputation: 8352

FeatureUnion: keep existing features plus add new engineered features (aka transformed columns)

Say I have a dataset with 2 numerical columns and I would like to add a third column which is the product of the two (or some other function of the two existing columns). I can compute the new feature using a ColumnTransformer:

tfs = make_column_transformer(
  (FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)

(X is a pandas DataFrame, therefore the indexing via column names. Note also the reshape I had to use. Maybe someone has a better idea there.)

As written above I would like to keep the original features (similar to what sklearn.preprocessing.PolynomialFeatures is doing), i.e. use all 3 columns to fit a linear model (or generally use them in an sklearn pipeline). How do I do this?

For example,

df = pd.DataFrame({'colX': [3, 4], 'colY': [2, 1]}) 
tfs = make_column_transformer(
  (FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
tfs.fit_transform(df)

gives

array([[1.        ],
      [1.73205081]])

but I would like to get an array that includes the original columns to pass this on in the pipeline.

The only way I could think of is a FeatureUnion with an identity transformation for the first two columns. Is there a more direct way?

(I would like to make a pipeline rather than change the DataFrame so that I do not forget to make the augmentation when calling model.predict().)

Upvotes: 0

Views: 439

Answers (1)

fuenfundachtzig
fuenfundachtzig

Reputation: 8352

Reading the documentation more carefully I found that it is possible to pass "special-cased strings" to "indicate to drop the columns or to pass them through untransformed, respectively."

So one possibility to achieve my goal is

tfs = make_column_transformer(
  (FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
  ("passthrough", df.columns)
)

yielding

array([[1.        , 3.        , 2.        ],
       [1.73205081, 4.        , 1.        ]])

In the end there is thus no need for FeatureUnion but it can be done with ColumnTransformer or make_column_transformer alone, resp.

Upvotes: 1

Related Questions