Peter Tran
Peter Tran

Reputation: 105

Column-specific processing in an sklearn pipeline

I have a situation where I need to do some column-specific processing in a pipeline, but because transformers return numpy arrays rather than pandas dataframes, I don't have column names to do my feature engineering.

Here's a simple, reproducible example where I have a function called engineer_feature that I want to use to create new data. I need to use it during/after the pipeline because it depends on one column becoming imputed, and I would like it to be able to be performed during k-fold cross-validation.

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler

df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})

def engineer_feature(df):
    df["C"] = df["A"] / df["B"]
    return df

categorical_transformer = Pipeline([
    ("one_hot", OneHotEncoder())
])

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer()),
    ("engineer", FunctionTransformer(engineer_feature)),
    ("scaler", StandardScaler())
])

preprocessor = ColumnTransformer([
    ("categorical", categorical_transformer, ["Type"]),
    ("numeric", numeric_transformer, ["A", "B"])
])

preprocessor.fit_transform(df)

Which yields this error:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Which makes sense because engineer_feature is trying to index columns as if though they are dataframes when they are just numpy arrays.

What's a strategy for getting around this? I don't want to hardcode column indices to access them via numpy, especially since my real dataframe has many more columns.

Upvotes: 4

Views: 2941

Answers (3)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

For your toy example to work you need to:

def engineer_feature(X):
    return np.c_[X,X[:,0]/X[:,1]]

categorical_transformer = Pipeline([
    ("one_hot", OneHotEncoder())
])

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer())
    ,("engineer", FunctionTransformer(engineer_feature))
    ,("scaler", StandardScaler())
])

preprocessor = ColumnTransformer([
    ("categorical", categorical_transformer, ["Type"]),
    ("numeric", numeric_transformer, ["A", "B"])
])

preprocessor.fit_transform(df)

FunctionTransformer() accepts numpy array, you cannot avoid hardcoding here.

Upvotes: 1

Peter Tran
Peter Tran

Reputation: 105

Thanks to the discussion and answers given by Nick and Sergey (specifically that I do know what columns of my dataframe I'm passing into engineer_feature), I've come up with a solution that is acceptable to me; though if anyone has a better idea, please chime in.

import numpy as np
import pandas as pd

from functools import partial
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler

df = pd.DataFrame({"Type": ["Beta", "Beta", "Alpha", "Charlie", "Beta", "Charlie"], "A": [1, 2, 3, np.nan, 22, 4], "B": [5, 7, 12, 21, 12, 10]})

def engineer_feature(columns, X):
    df = pd.DataFrame(X, columns=columns)
    df["C"] = df["A"] / df["B"]
    return df

categorical_transformer = Pipeline([
    ("one_hot", OneHotEncoder())
])

def numeric_transformer(columns):
    transformer = Pipeline([
        ("imputer", SimpleImputer()),
        ("engineer", FunctionTransformer(partial(engineer_feature, columns))),
        ("scaler", StandardScaler())
    ])

    return ("numeric", transformer, columns)

preprocessor = ColumnTransformer([
    ("categorical", categorical_transformer, ["Type"]),
    numeric_transformer(["A", "B"])
])

preprocessor.fit_transform(df)

It's worth nothing this depends on both columns A and B to have at least one value each so that SimpleImputer does not drop the column.

Upvotes: 2

Nick Kharas
Nick Kharas

Reputation: 76

There are ways to get around your challenge by adding few steps and simplifying the entire approach instead of trying to run everything on a single input dataframe.

  • For one hot encoding, you can use the get_dummies() function in pandas.
  • For calculating df["C"], you can write a lambda function and apply it to all rows in the dataframe using the apply function in pandas.
  • You should still rely on sklearn for imputing and scaling the numeric columns.
  • As you correctly mentioned, the output from sklearn will be a numpy array. You should convert it back to a pandas dataframe that can be used further.

In order to follow the above approach,

  • Split your dataframe into two, one with categorical columns and the other with numeric. Once you are done with data processing, use append in pandas to append them back.

    df_numeric.append(df_catgeorical)
    
  • You will need to save the output of each step in a new dataframe, and pass it further downstream in your data pipeline.

  • To release memory footprint, delete the old dataframe and call garbage collector

    import gc
    
    del df
    gc.collect() 
    
  • You do not need to save the column index of a numpy array. Simply use df.columns to return the dataframe's columns as a list. For example, below is what you can do to convert the output of a sklearn transformation into a dataframe

    sim = SimpleImputer()
    sklearn_output_array = sim.fit_transform(df_input)
    
    df_output = pd.DataFrame(sklearn_output_array, columns=df_input.columns)
    
    del df_input
    del sklearn_output_array
    gc.collect()
    
    df_output["C"] = df_output["A"] / df_output["B"]
    

I agree that the above approach will increase the number of lines of code. However, our code will be much more readable and easier to follow.

In addition to the above, below is another stack overflow post that deals with one-hot encoding and saving column names of transformed dataframes for further use downstream. The answer has some examples with code that you might find useful.

https://stackoverflow.com/a/60107683/12855052

Hope this all helps, and let me know if you have further questions!

Upvotes: 1

Related Questions