spark data partitioning for decision tree

Question

I am reading spark mllib documentation and in decision tree documentation it says -

 Each partition is chosen greedily by selecting the best split from a set 
of possible splits, in order to maximize the information gain at a tree node.

Here is the link .

I am not able to understand -

the partition that we are talking about, is it spark data partition or feature partition
Or could it be splits on each data partition?

sourabh · Accepted Answer

The reference to "partition" here has nothing to do with spark data partition.This is the partitioning of the data at a tree node based on a feature selected and pertains to the "data partitioning" as in the algorithm. If you check the actual implementation it queues all the nodes which need to be split and selects a bunch of them based on memory available(config).The idea being that the passes over data can be reduced if stats for a bunch of nodes and their features can be done over 1 pass. Then for each node it takes the subset of features(config) and calculates the statistics for each of the feature ;which gives a set of possible splits.Then the driver node(node here is spark driver machine;terms can be confusing :)) is sent only the best possible split and augments the tree.Each datum or a row in your rdd is represented as BaggedTreePoint and stores the information as to which node it currently belongs to. It will take slight bit of time to go through source code ;but maybe worth it.

spark data partitioning for decision tree

Answers (1)

Related Questions