Romain
Romain

Reputation: 809

IllegalArgumentException: u'requirement failed' on kmeans.fit

Using spark from a zeppelin notebook, I got this error since yesterday. Here's my code :

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

df = sqlContext.table("rfmdata_clust")

k = 4

# Set Kmeans input/output columns
vecAssembler = VectorAssembler(inputCols=["v1_clust", "v2_clust", "v3_clust"], outputCol="features")
featuresDf = vecAssembler.transform(df)

# Run KMeans
kmeans = KMeans().setInitMode("k-means||").setK(k)
model = kmeans.fit(featuresDf)
resultDf = model.transform(featuresDf)

# KMeans WSSSE
wssse = model.computeCost(featuresDf)
print("Within Set Sum of Squared Errors = " + str(wssse))

And here's the error :

Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 346, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 334, in <module>
    exec(code)
  File "<stdin>", line 8, in <module>
  File "/usr/lib/spark/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
    java_model = self._fit_java(dataset)
  File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: u'requirement failed'

The line which throws the error is the kmeans.fit() one. I checked the rfmdata_clust dataframe and it doesn't seem to be weird at all.

df.printSchema() gives :

root
 |-- id: string (nullable = true)
 |-- v1_clust: double (nullable = true)
 |-- v2_clust: double (nullable = true)
 |-- v3_clust: double (nullable = true)

featuresDf.printSchema() gives :

root
 |-- id: string (nullable = true)
 |-- v1_clust: double (nullable = true)
 |-- v2_clust: double (nullable = true)
 |-- v3_clust: double (nullable = true)
 |-- features: vector (nullable = true)

An other interesting point is that adding featuresDf = featuresDf.limit(10000) below the definition of featuresDf make the code running without errors. Maybe it is related to the size of the data ?

Upvotes: 1

Views: 2863

Answers (1)

Hopefully this had been solved , If not, Please try this

    df=df.na.fill(1)

This would fill all the values with NaN to 1, Of course you could choose any other values. The error is due to the fact that, You have NaN in the feature vector. You might need to import necessary packages. This should help too.

Let me know if this fails.

Upvotes: 2

Related Questions