boot-scootin
boot-scootin

Reputation: 12515

scikit-learn: ColumnTransformer and OneHotEncoder – how to err out for all new categorical levels across all fields?

I'm attempting to use scikit's ColumnTransformer class as both an actual DataFrame transformer and as a "monitoring" transformer – i.e., an object to monitor when new classes come into categorical features in my dataset.

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Original DataFrame off of which transformers are fit
orig_df = pd.DataFrame(
    {
        'a': [np.nan, 'a', 'b', 'b', 'a'],
        'b': ([np.nan] * 3) + ['a', 'a'],
        'c': np.random.randn(5)
    }
)

# New DataFrame that will be transformed using already fitted transformer
new_df = pd.DataFrame(
    {
        'a': [np.nan, 'a', 'b', 'b', 'c'],
        'b': ([np.nan] * 4) + ['b'],
        'c': np.random.randn(5)
    }
)

# Cast NaNs to str to play nicely with OneHotEncoder
for col in ('a', 'b'):
    orig_df[col] = orig_df[col].astype(str)
    new_df[col] = new_df[col].astype(str)

# Create master transformer for each of the three columns a, b, and c
transformer_config = [
    ('a', OneHotEncoder(sparse=False, handle_unknown='error'), ['a']),
    ('b', OneHotEncoder(sparse=False, handle_unknown='error'), ['b']),
    ('c', 'passthrough', ['c']),
]

transformer = ColumnTransformer(transformer_config)

# Fit to original dataset
transformer.fit(orig_df)

# Transform new dataset
transformer.transform(new_df)

Which produces:

  File "<stdin>", line 2, in <module>
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 495, in transform
    Xs = self._fit_transform(X, None, _transform_one, fitted=True)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 393, in _fit_transform
    fitted=fitted, replace_strings=True))
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/pipeline.py", line 605, in _transform_one
    res = transformer.transform(X)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 591, in transform
    return self._transform_new(X)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 553, in _transform_new
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 109, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['c'] in column 0 during transform

This produces the error I generally want, but only for one column. As you can see in new_df, column b has a new level, too, ('b'). Is there a straightforward way of reporting back all new levels for all fields that use this OneHotEncoder class, instead of just the first one that errs out?

My first thought was to try iterating through each field individually, try-catching each ValueError, but that doesn't play nicely with ColumnTransformer:

>>> transformer.transform(new_df[['b']])
KeyError: "None of [['a']] are in the [columns]"

Upvotes: 1

Views: 1773

Answers (1)

Jan K
Jan K

Reputation: 4150

Just a suggested solution for your example:

from sklearn.base import BaseEstimator

for _, t_inst, t_col in transformer.transformers_:
    try:
        if isinstance(t_inst, BaseEstimator):
            t_inst.transform(new_df[t_col])
        else:
            pass

    except Exception as e:
        print('During transformation of column {} the following error occurred: {}'.format(t_col, e))

Output

During transformation of column ['a'] the following error occured: Found unknown categories ['c'] in column 0 during transform
During transformation of column ['b'] the following error occured: Found unknown categories ['b'] in column 0 during transform

It simply tries to apply the transformations one by one.

Note that .transformers_ attribute is only available after fitting

Upvotes: 1

Related Questions