Reputation: 131
I am migrating from R to PySpark. I have a process that creates a regression tree that is currently built using R's rpart
algorithm.
While configuring this in PySpark, I am unable to see an option to specify a custom custom impurity function. I have a skewed dataset, and instead of using mean and variance/ standard deviation in the formula as criterion for impurity of a node, I want to use a metric more suited for my skewed data. How can I define a custom impurity function in PySpark?
I've looked at the documentation for Decision Tree Regression and documentation for the impurity
parameter only mentions support for variance
impurity = Param(parent='undefined', name='impurity', doc='Criterion used for information gain calculation (case-insensitive). Supported options: variance')
Is there any workaround to define a custom impurity function?
Upvotes: 1
Views: 85
Reputation: 1762
This doesn't seem to be a possibility. I was looking for this a few years ago and nothing seems to have changed since then.
In my case I used a workaround of transforming my label to reduce the skew (like applying the log transform), fitting the model and un-transforming it back during inference to get the actual prediction.
Another option would be to write your class for a custom regression decision tree that directly makes use of lower-level Spark APIs and uses a custom impurity function.
Upvotes: 0