SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

Question

I've got a LabeledPoint on witch I want to run a decision tree (and later random forest)

scala> transformedData.collect
res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064,200070,200072,200075,200087,400008,400011],[4.0,1.0,6.0,53.0,6.0,1.0,1.0,2.0,11.0,17.0,48.0,3.0,4.0,113.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,1.0,1.0,1.0,1.0,1.0,4.0])), (0.0,(400036,[1589,3585,4830,6935,6936,7744,400008,400011],[2.0,6.0,3.0,52.0,4.0,3.0,1.0,2.0])), (0.0,(400036,[1589,2162,2784,2922,4123,7008,7131,7792,8608],[23.0,70.0,1.0,2.0,2.0,1.0,1.0,2.0,2.0])), (0.0,(400036,[4830,6935,6936,400008,400011],[1.0,36.0,...

using code:

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.impurity.Gini

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]() //change to what?
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(
  trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

In my data I've got two types of features:

some features are counts from user visits on a given website/domain (feature is a website/domain and its value is number of visits)
rest of the features are some declarative variables - binary/categorical

Is there a way to create categoricalFeaturesInfo automatically from LabeledPoint? I want to check the levels of my declarative variables (type 2), then having this information create categoricalFeaturesInfo.

I have a list with the the declarative variables:

List(6363,21345,23455,...

zero323 · Accepted Answer

categoricalFeaturesInfo should map from an index to a number of classes for a given feature. Generally speaking identifying categorical variables can be expensive, especially if these are heavily mixed with continuous variables. Moreover, depending on your data, it can give both false positive and false negatives. Keeping that in mind it is better to set these values manually.

If you still want to create categoricalFeaturesInfo automatically you can take a look at the ml.feature.VectorIndexer. It is not directly applicable in this case but should provide an useful code base to build your own solution.

SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

Answers (1)

Related Questions