Reputation: 1019
The following is a part of Kmeans algorithm which is written with Apache Spark:
closest = data.map(lambda p: (closestPoint(p, kPoints), (p, 1)))
pointStats = closest.reduceByKey(lambda (x1, y1), (x2, y2): (x1 + x2, y1 + y2))
newPoints = pointStats.map(lambda (x, (y, z)): (x, y / z)).collect()
Can anyone explain to me how does it work? Assuming we have two clusters and 1000 points and we want to run it in cluster with two slave nodes and one master node. I think the first function (closest) can be considered as mapper and the second one would be combiner but what about the last function what it is suppose to do? and which one act as reducer?
Upvotes: 2
Views: 2477
Reputation: 18750
You pass the reduceByKey
a function that can be used as a combiner and reducer, because you are required to pass it an aggregate function, if your use case can't use a combiner you need to use groupByKey
. And yes anytime you call map
on an RDD in spark the function you pass it can be viewed as a mapper. You should definitely take a look at the RDD
docs and the PairRDDFunctions
. Keep in mind a spark program will tend to have multiple stages of mapping and reducing since it attempts to keep intermediate output in memory, where as standard Hadoop MapReduce reads from and writes to disk every time. Also if you are using spark you can use the k-means in MLlib
UPDATE:
In reference to your comment the reason they "map the (overall sum / num points) to every slave node" is because the way spark works means this has no overhead. Since spark uses a DAG for every RDD nothing is calculated until an action (like collect()
in this case) is performed) so the map at the end can actually seamlessly get the output of the reducer, which shouldn't spill to disk since it is very small. This is similar to a ChainReducer
in Hadoop however in spark every step in connected RDDs is kept in memory (obviously this is not always possible, so sometimes it will spill to disk, this also depends on the serialization level). So basically the last calculation will in reality be done on the same node as the reducer (no shuffle needed after) and only then collected to the driver.
Upvotes: 1
Reputation: 3261
If you want to understand the spark code, the first thing to do is understand k-means. As a high level overview, k-means does these steps:
your first line is step (2) of the algorithm; your second and third lines are step (3) of the algorithm. What those two lines are doing are finding the average of all of the points in a cluster.
The final 'collect' method call moves the data from an RDD to the local machine (in preparation of broadcasting the new centroids back out the all of the nodes in your Spark environment). In order to repeat steps (2) and (3), you have to distribute the knowledge of what the current centroids are out to each node.
Upvotes: 1