Reputation: 8352
Say I have a dataset with 2 numerical columns and I would like to add a third column which is the product of the two (or some other function of the two existing columns). I can compute the new feature using a ColumnTransformer
:
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
(X
is a pandas DataFrame
, therefore the indexing via column names. Note also the reshape
I had to use. Maybe someone has a better idea there.)
As written above I would like to keep the original features (similar to what sklearn.preprocessing.PolynomialFeatures
is doing), i.e. use all 3 columns to fit a linear model (or generally use them in an sklearn
pipeline). How do I do this?
For example,
df = pd.DataFrame({'colX': [3, 4], 'colY': [2, 1]})
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
)
tfs.fit_transform(df)
gives
array([[1. ],
[1.73205081]])
but I would like to get an array that includes the original columns to pass this on in the pipeline
.
The only way I could think of is a FeatureUnion
with an identity transformation for the first two columns. Is there a more direct way?
(I would like to make a pipeline rather than change the DataFrame
so that I do not forget to make the augmentation when calling model.predict()
.)
Upvotes: 0
Views: 439
Reputation: 8352
Reading the documentation more carefully I found that it is possible to pass "special-cased strings" to "indicate to drop the columns or to pass them through untransformed, respectively."
So one possibility to achieve my goal is
tfs = make_column_transformer(
(FunctionTransformer(lambda X: np.sqrt(X[:,0] - X[:,1]).reshape(-1, 1)), ["colX", "colY"]),
("passthrough", df.columns)
)
yielding
array([[1. , 3. , 2. ],
[1.73205081, 4. , 1. ]])
In the end there is thus no need for FeatureUnion
but it can be done with ColumnTransformer
or make_column_transformer
alone, resp.
Upvotes: 1