Priya Gupta
Priya Gupta

Reputation: 59

Spark MLLIB parallelism multiple nodes

Can Machine learning algorithms provided by "spark mllib" like naive byes,random forest run in parallel mode across spark cluster? OR we need to change code? Kindly provide an example to run in parallel? Not sure how parallelism work (map) in MLLIB - as each processing requires entire training data set. Does computation run in parallel with subset of training data? Thanks

Upvotes: 3

Views: 2245

Answers (1)

Katya Willard
Katya Willard

Reputation: 2182

These algorithms as provided by Spark MLLib do run in parallel automatically. They expect an RDD as input. An RDD is a resilient distributed dataset, spread across a cluster of computers.

Here is an example problem using a Decision Tree for classification problems.

I highly recommend exploring in depth the link provided above. The page has extensive documentation and examples of how to code these algorithms, including generating training and testing datasets, scoring, cross validation, etc.

These algorithms run in parallel by running computations on the worker nodes' subset of the data, and then sharing the results of those computations across worker nodes and with the master node. The master node collects the results of individual computations and aggregates them as necessary to make decisions based on the entire dataset. Computation heavy activities are mostly executed on the worker nodes.

Upvotes: 1

Related Questions