Reputation: 538
So I currently have a Pipeline that has a lot of customer transformers:
p = Pipeline([
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])
Each transformer takes in a pandas dataframe and returns the same dataframe with one or more new columns. It actually works quite well, but how can I run the "GetTimeFromDate" and the "GetZipFromAddress" steps in parallel?
I would like to use FeatureUnion:
f = FeatureUnion([
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column])
])
p = Pipeline([
("FeatureUnionStep",f),
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])
But the problem is that FeatureUnion returns a numpy.ndarray, but the "GroupByTimeandZip" step needs a dataframe.
Is there a way I can get FeatureUnion to return a pandas dataframe?
Upvotes: 3
Views: 2262
Reputation: 3283
In scikit-learn 1.5.2 you can set the output of the transformers to dataframe with
from sklearn import set_config
set_config(transform_output = "pandas")
Currently, the allowed output types are
"default": Default output format of a transformer
"pandas": DataFrame output
"polars": Polars output
None: Transform configuration is unchanged
Please check set_config documentation for further information.
Upvotes: 0
Reputation: 8953
For a FeatureUnion
to output a DataFrame
you can use the PandasFeatureUnion
from this blog post. Also see the gist.
Upvotes: 2