Reputation: 1366
I am running k-means clustering on scala 0.9.0 and I am trying to understand how the data is distributed among n systems to calculate k center data points.
I understand what k-means clustering is but I want to know how the data is divided and calculation is done on a distributed computation (map and reduce). In this scala version, KMeansDataGenerator has option to generate data points into n partitions. Does each slave node get one partition of data file?
Upvotes: 1
Views: 416
Reputation: 20816
KMeansDataGenerator
uses sc.parallelize
to generate the data. There is a parameter in sc.parallelize
is the partition number. You can change it via KMeansDataGenerator
's option.
After that, SparkKMeans
will use this partition number in the whole k-means algorithm.
Does each slave node get one partition of data file?
Spark does not guarantee the location of partitions. However, it will try to schedule the computation to the nearest node which has the partition file.
Upvotes: 5