sklearn pipelines and filtering out records?

Question

I have been trying to come up with a transformer for the sklearn Pipeline architecture that actually filters and removes records meeting certain criteria in the course of the pipeline - a WHERE clause, if you will. I found this SO answer that says "any transformer that drops or adds samples is, as of the existing versions of scikit-learn, not compliant with the API", but it is from 7 years ago. Have things changed, and if not, what is a good statement of the Pipeline philosophy that explains why removing records does not fit with its concepts? (I definitely get why adding samples doesn't make sense, and I think I get why removing might not, but I'd like to read the official reasons for this in order to better grasp the thinking around the API.)

afsharov · Accepted Answer

This statement is still valid as of today.

I am not aware of any "official" statement in the documentation of scikit-learn that addresses this issue or justifies this design choice. However, I believe the main reason this is not supported is the fact that Pipeline objects only transform X. At least for this, there is a source:

Pipelines only transform the observed data (X).

In consequence, if you would drop or add any samples to X, the number of samples would become inconsistent with y. I think this is the most likely reason for this design choice.

That being said, the pipeline implementation of imblearn does in fact allow transformations that change the sample size as it allows to include resamplers in the pipeline (see here). However, this is restricted to sampling methods that comply with the imblearn API methods only.

sklearn pipelines and filtering out records?

Answers (1)

Related Questions