willing_astronomer
willing_astronomer

Reputation: 131

User defined impurity in Regression Decision Trees

I am migrating from R to PySpark. I have a process that creates a regression tree that is currently built using R's rpart algorithm.

While configuring this in PySpark, I am unable to see an option to specify a custom custom impurity function. I have a skewed dataset, and instead of using mean and variance/ standard deviation in the formula as criterion for impurity of a node, I want to use a metric more suited for my skewed data. How can I define a custom impurity function in PySpark?

I've looked at the documentation for Decision Tree Regression and documentation for the impurity parameter only mentions support for variance

impurity = Param(parent='undefined', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: variance')

Is there any workaround to define a custom impurity function?

Upvotes: 1

Views: 85

Answers (1)

Sachin Hosmani
Sachin Hosmani

Reputation: 1762

This doesn't seem to be a possibility. I was looking for this a few years ago and nothing seems to have changed since then.

In my case I used a workaround of transforming my label to reduce the skew (like applying the log transform), fitting the model and un-transforming it back during inference to get the actual prediction.

Another option would be to write your class for a custom regression decision tree that directly makes use of lower-level Spark APIs and uses a custom impurity function.

Upvotes: 0

Related Questions