Reputation: 1337
Recently I'm learning scalable machine learning and Spark MLlib is the first tool I learned to use. I have already succeeded to implement some simple machine learning tasks such as linear regression with Spark MLlib, and they all run smoothly on my laptop.
However, I'm wondering, the program is not deployed on a cluster, and it's running on a single node. Is it still not distributed in this kind of scenario. If it's distributed, does Spark automatically run tasks with multi-threads?
Can anybody tell me the reason why Spark MLlib makes scalable machine learning implementation easier?
Upvotes: 0
Views: 490
Reputation: 501
Well, it depends on what your definition of "distributed" is.
Spark MLlib is a framework that allows (but not guarantees) you to write code that is capable of being distributed. It handles a lot of the distribution and synchronisation issues that come with distributed computing. So yes, it makes it much simpler for programmers to code and deploy distributed algorithms.
The reason why Spark makes scalable ML easier is because you can focus more on the algorithm, rather than being bogged down by data races and how to distribute code to different nodes, taking into account data locality etc. All of that is typically handled by the SparkContext / RDD class.
That being said, coding for Spark does not guarantee that it will be distributed optimally. There are still things to consider like data partitioning and level of parallelism, among many others.
Upvotes: 1