Jack
Jack

Reputation: 538

How to make FeatureUnion return Dataframe

So I currently have a Pipeline that has a lot of customer transformers:

p = Pipeline([
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])

Each transformer takes in a pandas dataframe and returns the same dataframe with one or more new columns. It actually works quite well, but how can I run the "GetTimeFromDate" and the "GetZipFromAddress" steps in parallel?

I would like to use FeatureUnion:

f = FeatureUnion([  
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column])
])

p = Pipeline([
("FeatureUnionStep",f),
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])

But the problem is that FeatureUnion returns a numpy.ndarray, but the "GroupByTimeandZip" step needs a dataframe.

Is there a way I can get FeatureUnion to return a pandas dataframe?

Upvotes: 3

Views: 2262

Answers (2)

Galuoises
Galuoises

Reputation: 3283

In scikit-learn 1.5.2 you can set the output of the transformers to dataframe with

from sklearn import set_config
set_config(transform_output = "pandas")

Currently, the allowed output types are

  • "default": Default output format of a transformer

  • "pandas": DataFrame output

  • "polars": Polars output

  • None: Transform configuration is unchanged

Please check set_config documentation for further information.

Upvotes: 0

Stuart Hallows
Stuart Hallows

Reputation: 8953

For a FeatureUnion to output a DataFrame you can use the PandasFeatureUnion from this blog post. Also see the gist.

Upvotes: 2

Related Questions