Spark k-means OutOfMemoryError

Question

I am using spark's k-means clustering under Ml module and I am programming in PySpark. The module works well until 200 clusters but it gives OutOfMemoryError once I go past 300 clusters and more. My data contains 200k objects and 25k features for each object. I am following the guidelines mentioned under the class pyspark.ml.clustering.KMeans from the link pyspark ML mocumentation. The only difference between the code mentioned in this documentation and mine is that I am using sparse vectors instead of dense.

There is no hardware limitation since I am having a resonably large cluster setup which has over 700 cores and 2TB memory. I searched for this problem and most of the links lead me to do one/all of the following configurations. The following are the set of things that I tried:

set/increase driver memory using conf.set("spark.driver.memory", "64g")
set parallelism conf.set("spark.default.parallelism","1000")
set/increase memory fraction conf.set("spark.storage.memoryFraction", "1")

In addition to the above configuration I set the executor memory as 16g and cores as 150.Unfortunately nothing has worked out and I keep getting the following error (error truncated).

Py4JJavaError: An error occurred while calling o98.fit. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.mllib.linalg.SparseVector.toArray(Vectors.scala:678) at org.apache.spark.mllib.clustering.VectorWithNorm.toDense(KMeans.scala:612)

Does this mean spark cannot handle even a 200k*25K dataset for 300+ cluster size?, or am I missing something?.

Spark k-means OutOfMemoryError

Answers (1)

Related Questions