Reputation: 1257
I am using spark's k-means clustering under Ml module and I am programming in PySpark. The module works well until 200 clusters but it gives OutOfMemoryError
once I go past 300 clusters and more. My
data contains 200k objects and 25k features for each object. I am following the guidelines mentioned under the class pyspark.ml.clustering.KMeans
from the link pyspark ML mocumentation. The only difference between the code mentioned in this documentation and mine is that I am using sparse vectors instead of dense.
There is no hardware limitation since I am having a resonably large cluster setup which has over 700 cores and 2TB memory. I searched for this problem and most of the links lead me to do one/all of the following configurations. The following are the set of things that I tried:
conf.set("spark.driver.memory", "64g")
conf.set("spark.default.parallelism","1000")
conf.set("spark.storage.memoryFraction", "1")
In addition to the above configuration I set the executor memory as 16g and cores as 150.Unfortunately nothing has worked out and I keep getting the following error (error truncated).
Py4JJavaError: An error occurred while calling o98.fit. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.mllib.linalg.SparseVector.toArray(Vectors.scala:678) at org.apache.spark.mllib.clustering.VectorWithNorm.toDense(KMeans.scala:612)
Does this mean spark cannot handle even a 200k*25K
dataset for 300+ cluster size?, or am I missing something?.
Upvotes: 1
Views: 873
Reputation: 8995
org.apache.spark.mllib.clustering.VectorWithNorm.toDense(KMeans.scala:612)
That's the problem. The cluster centres are converted to a dense representation, and then broadcasted to all the executors. This will not scale with thousands of features, which is your case. Checkout SparseML.
Upvotes: 1