Reputation: 10843
I have been trying to come up with a transformer for the sklearn
Pipeline
architecture that actually filters and removes records meeting certain criteria in the course of the pipeline - a WHERE
clause, if you will. I found this SO answer that says "any transformer that drops or adds samples is, as of the existing versions of scikit-learn, not compliant with the API", but it is from 7 years ago. Have things changed, and if not, what is a good statement of the Pipeline
philosophy that explains why removing records does not fit with its concepts? (I definitely get why adding samples doesn't make sense, and I think I get why removing might not, but I'd like to read the official reasons for this in order to better grasp the thinking around the API.)
Upvotes: 3
Views: 1104
Reputation: 5174
This statement is still valid as of today.
I am not aware of any "official" statement in the documentation of scikit-learn
that addresses this issue or justifies this design choice. However, I believe the main reason this is not supported is the fact that Pipeline
objects only transform X
. At least for this, there is a source:
Pipelines only transform the observed data (
X
).
In consequence, if you would drop or add any samples to X
, the number of samples would become inconsistent with y
. I think this is the most likely reason for this design choice.
That being said, the pipeline implementation of imblearn
does in fact allow transformations that change the sample size as it allows to include resamplers in the pipeline (see here). However, this is restricted to sampling methods that comply with the imblearn
API methods only.
Upvotes: 4