Rkz
Rkz

Reputation: 1257

Spark k-means OutOfMemoryError

I am using spark's k-means clustering under Ml module and I am programming in PySpark. The module works well until 200 clusters but it gives OutOfMemoryError once I go past 300 clusters and more. My data contains 200k objects and 25k features for each object. I am following the guidelines mentioned under the class pyspark.ml.clustering.KMeans from the link pyspark ML mocumentation. The only difference between the code mentioned in this documentation and mine is that I am using sparse vectors instead of dense.

There is no hardware limitation since I am having a resonably large cluster setup which has over 700 cores and 2TB memory. I searched for this problem and most of the links lead me to do one/all of the following configurations. The following are the set of things that I tried:

In addition to the above configuration I set the executor memory as 16g and cores as 150.Unfortunately nothing has worked out and I keep getting the following error (error truncated).

Py4JJavaError: An error occurred while calling o98.fit. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.mllib.linalg.SparseVector.toArray(Vectors.scala:678) at org.apache.spark.mllib.clustering.VectorWithNorm.toDense(KMeans.scala:612)

Does this mean spark cannot handle even a 200k*25K dataset for 300+ cluster size?, or am I missing something?.

Upvotes: 1

Views: 873

Answers (1)

Aman
Aman

Reputation: 8995

org.apache.spark.mllib.clustering.VectorWithNorm.toDense(KMeans.scala:612)

That's the problem. The cluster centres are converted to a dense representation, and then broadcasted to all the executors. This will not scale with thousands of features, which is your case. Checkout SparseML.

Upvotes: 1

Related Questions