Reputation: 105
I have a situation where I need to do some column-specific processing in a pipeline, but because transformers return numpy arrays rather than pandas dataframes, I don't have column names to do my feature engineering.
Here's a simple, reproducible example where I have a function called engineer_feature
that I want to use to create new data. I need to use it during/after the pipeline because it depends on one column becoming imputed, and I would like it to be able to be performed during k-fold cross-validation.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})
def engineer_feature(df):
df["C"] = df["A"] / df["B"]
return df
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
numeric_transformer = Pipeline([
("imputer", SimpleImputer()),
("engineer", FunctionTransformer(engineer_feature)),
("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
("numeric", numeric_transformer, ["A", "B"])
])
preprocessor.fit_transform(df)
Which yields this error:
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Which makes sense because engineer_feature
is trying to index columns as if though they are dataframes when they are just numpy arrays.
What's a strategy for getting around this? I don't want to hardcode column indices to access them via numpy, especially since my real dataframe has many more columns.
Upvotes: 4
Views: 2941
Reputation: 25189
For your toy example to work you need to:
def engineer_feature(X):
return np.c_[X,X[:,0]/X[:,1]]
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
numeric_transformer = Pipeline([
("imputer", SimpleImputer())
,("engineer", FunctionTransformer(engineer_feature))
,("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
("numeric", numeric_transformer, ["A", "B"])
])
preprocessor.fit_transform(df)
FunctionTransformer()
accepts numpy array, you cannot avoid hardcoding here.
Upvotes: 1
Reputation: 105
Thanks to the discussion and answers given by Nick and Sergey (specifically that I do know what columns of my dataframe I'm passing into engineer_feature
), I've come up with a solution that is acceptable to me; though if anyone has a better idea, please chime in.
import numpy as np
import pandas as pd
from functools import partial
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})
def engineer_feature(columns, X):
df = pd.DataFrame(X, columns=columns)
df["C"] = df["A"] / df["B"]
return df
categorical_transformer = Pipeline([
("one_hot", OneHotEncoder())
])
def numeric_transformer(columns):
transformer = Pipeline([
("imputer", SimpleImputer()),
("engineer", FunctionTransformer(partial(engineer_feature, columns))),
("scaler", StandardScaler())
])
return ("numeric", transformer, columns)
preprocessor = ColumnTransformer([
("categorical", categorical_transformer, ["Type"]),
numeric_transformer(["A", "B"])
])
preprocessor.fit_transform(df)
It's worth nothing this depends on both columns A
and B
to have at least one value each so that SimpleImputer
does not drop the column.
Upvotes: 2
Reputation: 76
There are ways to get around your challenge by adding few steps and simplifying the entire approach instead of trying to run everything on a single input dataframe.
get_dummies()
function in pandas
. df["C"]
, you can write a lambda
function and apply it to all rows in the dataframe using the apply
function in pandas
.sklearn
for imputing and scaling the numeric columns. sklearn
will be a numpy
array. You should convert it back to a pandas dataframe that can be used further.In order to follow the above approach,
Split your dataframe into two, one with categorical columns and the other with numeric. Once you are done with data processing, use append
in pandas
to append them back.
df_numeric.append(df_catgeorical)
You will need to save the output of each step in a new dataframe, and pass it further downstream in your data pipeline.
To release memory footprint, delete the old dataframe and call garbage collector
import gc
del df
gc.collect()
You do not need to save the column index of a numpy
array. Simply use df.columns
to return the dataframe's columns as a list. For example, below is what you can do to convert the output of a sklearn
transformation into a dataframe
sim = SimpleImputer()
sklearn_output_array = sim.fit_transform(df_input)
df_output = pd.DataFrame(sklearn_output_array, columns=df_input.columns)
del df_input
del sklearn_output_array
gc.collect()
df_output["C"] = df_output["A"] / df_output["B"]
I agree that the above approach will increase the number of lines of code. However, our code will be much more readable and easier to follow.
In addition to the above, below is another stack overflow post that deals with one-hot encoding and saving column names of transformed dataframes for further use downstream. The answer has some examples with code that you might find useful.
https://stackoverflow.com/a/60107683/12855052
Hope this all helps, and let me know if you have further questions!
Upvotes: 1