don007
don007

Reputation: 53

Why the Constructor for Sklearn Transformer within ColumnTransformer is invoked twice, further, the parameters for two invocations are different

Three questions for below code and its output:

  1. Why the constructor for MyDebug Transformer being invoked twice, first time for line 26, and second time for line 37?
  2. Why the two invocations show different parameter myname, especially weird for second invocation for line 37, why it doesn't take in the passed parameter, not even default value, but None instead as in the output?
  3. If you uncomment line 36, ct1.fit, it also invokes Transformer's transform function, which is only expected for ct1.fit_transform?

Environment: Python version is 3.6.10 and Sklearn version is 0.22.1

  1 import numpy as np
  2 from sklearn.compose import ColumnTransformer
  3 from sklearn.preprocessing import Normalizer
  4 from sklearn.base import BaseEstimator,TransformerMixin
  5 from sklearn.pipeline import Pipeline
  6 from datetime import datetime
  7
  8
  9
 10 class MyDebug(BaseEstimator, TransformerMixin):
 11     def __init__(self, myname="HELP"):
 12         print(f"intialized with myname: {myname}")
 13         self._name = myname
 14         print (f"Debug.__init__ being invoked for {myname}, {self._name}, {id(self)}")
 15     def transform(self, X):
 16         print (f"in {self._name} transform with type: {type(X)}, shape: {X.shape} at {datetime.now()}")
 17         self.shape = X.shape
 18         # what other output you want
 19         return X
 20     def fit(self, X, y=None, **fit_params):
 21         print (f"in {self._name} fit with type: {type(X)}, shape: {X.shape} at {datetime.now()}")
 22         return self
 23
 24
 25 print("************************************************************")
 26 ct1 = ColumnTransformer(
 27     [("norm1", Pipeline(steps=[("norm", Normalizer(norm='l1')), ("debug", MyDebug("MYDEBUG_1"))]), [0, 1]),
 28      ("norm2", Pipeline(steps=[("norm", Normalizer(norm='l1')), ("debug", MyDebug("MYDEBUG_2"))]), slice(2, 10))])
 29
 30 print("************************************************************")
 31 print(f"id(ct1): {id(ct1)}")
 32 X = np.array([[0., 1., 2., 2., 0., 1., 2., 2.],
 33               [1., 1., 0., 1., 1., 1., 0., 1.]])
 34
 35 print("************************************************************")
 36 # ret = ct1.fit(X)
 37 ret = ct1.fit_transform(X)
 38 print("************************************************************")
 39 print(f"id(ct1): {id(ct1)}")
 40 print(f"type(ret): {type(ret)}")
 41 print(type(ct1.named_transformers_["norm1"]), id(ct1.named_transformers_["norm1"]), id(ct1.named_transformers_["norm2"]), "\n",
 42 type(ct1.named_transformers_["norm1"].named_steps["norm"]), id(ct1.named_transformers_["norm1"].named_steps["norm"]), id(ct1.named_transformers_["norm2"].named_steps["norm"]), "\n",
 43 type(ct1.named_transformers_["norm1"].named_steps["debug"]), id(ct1.named_transformers_["norm1"].named_steps["debug"]), id(ct1.named_transformers_["norm2"].named_steps["debug"]))

Output:

************************************************************
intialized with myname: MYDEBUG_1
Debug.__init__ being invoked for **MYDEBUG_1, MYDEBUG_1**, 140118618819160
intialized with myname: MYDEBUG_2
Debug.__init__ being invoked for **MYDEBUG_2, MYDEBUG_2**, 140118618819216
************************************************************
id(ct1): 140118618819328
************************************************************
intialized with myname: None
Debug.__init__ being invoked for **None, None**, 140118618819944
in None fit with type: <class 'numpy.ndarray'>, shape: (2, 2) at 2021-03-24 00:45:41.850603
in None transform with type: <class 'numpy.ndarray'>, shape: (2, 2) at 2021-03-24 00:45:41.851159
intialized with myname: None
Debug.__init__ being invoked for **None, None**, 140118618820392
in None fit with type: <class 'numpy.ndarray'>, shape: (2, 6) at 2021-03-24 00:45:41.852955
in None transform with type: <class 'numpy.ndarray'>, shape: (2, 6) at 2021-03-24 00:45:41.852995
************************************************************
id(ct1): 140118618819328
type(ret): <class 'numpy.ndarray'>
<class 'sklearn.pipeline.Pipeline'> 140118618819776 140118618820000 
 <class 'sklearn.preprocessing._data.Normalizer'> 140118618819888 140118618820112 
 <class '__main__.MyDebug'> 140118618819944 140118618820392

Upvotes: 1

Views: 147

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12602

  1. Why the constructor for MyDebug Transformer being invoked twice, first time for line 26, and second time for line 37?

The instantiation at line 26 should be expected, because you call the class. It happens again at line 37 because ColumnTransformer clones its transformers before fitting.

  1. Why the two invocations show different parameter myname, especially weird for second invocation for line 37, why it doesn't take in the passed parameter, not even default value, but None instead as in the output?

This is in part because your custom estimator doesn't adhere to the sklearn API: __init__ parameters need to match exactly the attribute names being set. This is because of how sklearn clones estimators: the parameters are grabbed using get_params, and then a new class is instantiated using those parameters. The 0.21 behavior of get_params, in the case of mismatched parameters/attributes, is to return None; in 0.24 that changes to raising an AttributeError.

  1. If you uncomment line 36, ct1.fit, it also invokes Transformer's transform function, which is only expected for ct1.fit_transform?

In order to properly decide whether its output will be sparse or dense, ColumnTransformer.fit actually calls ColumnTransformer.fit_transform (rather than the usual reverse) (source).

Upvotes: 1

Related Questions