K-Means calculation on a distributed computation

Question

I am running k-means clustering on scala 0.9.0 and I am trying to understand how the data is distributed among n systems to calculate k center data points.

I understand what k-means clustering is but I want to know how the data is divided and calculation is done on a distributed computation (map and reduce). In this scala version, KMeansDataGenerator has option to generate data points into n partitions. Does each slave node get one partition of data file?

zsxwing · Accepted Answer

KMeansDataGenerator uses sc.parallelize to generate the data. There is a parameter in sc.parallelize is the partition number. You can change it via KMeansDataGenerator's option.

After that, SparkKMeans will use this partition number in the whole k-means algorithm.

Does each slave node get one partition of data file?

Spark does not guarantee the location of partitions. However, it will try to schedule the computation to the nearest node which has the partition file.

K-Means calculation on a distributed computation

Answers (1)

Related Questions