Reputation: 809
Using spark from a zeppelin notebook, I got this error since yesterday. Here's my code :
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
df = sqlContext.table("rfmdata_clust")
k = 4
# Set Kmeans input/output columns
vecAssembler = VectorAssembler(inputCols=["v1_clust", "v2_clust", "v3_clust"], outputCol="features")
featuresDf = vecAssembler.transform(df)
# Run KMeans
kmeans = KMeans().setInitMode("k-means||").setK(k)
model = kmeans.fit(featuresDf)
resultDf = model.transform(featuresDf)
# KMeans WSSSE
wssse = model.computeCost(featuresDf)
print("Within Set Sum of Squared Errors = " + str(wssse))
And here's the error :
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 346, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-8890997346928959256.py", line 334, in <module>
exec(code)
File "<stdin>", line 8, in <module>
File "/usr/lib/spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 236, in _fit
java_model = self._fit_java(dataset)
File "/usr/lib/spark/python/pyspark/ml/wrapper.py", line 233, in _fit_java
return self._java_obj.fit(dataset._jdf)
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: u'requirement failed'
The line which throws the error is the kmeans.fit() one. I checked the rfmdata_clust dataframe and it doesn't seem to be weird at all.
df.printSchema()
gives :
root
|-- id: string (nullable = true)
|-- v1_clust: double (nullable = true)
|-- v2_clust: double (nullable = true)
|-- v3_clust: double (nullable = true)
featuresDf.printSchema()
gives :
root
|-- id: string (nullable = true)
|-- v1_clust: double (nullable = true)
|-- v2_clust: double (nullable = true)
|-- v3_clust: double (nullable = true)
|-- features: vector (nullable = true)
An other interesting point is that adding featuresDf = featuresDf.limit(10000)
below the definition of featuresDf make the code running without errors. Maybe it is related to the size of the data ?
Upvotes: 1
Views: 2863
Reputation: 36
Hopefully this had been solved , If not, Please try this
df=df.na.fill(1)
This would fill all the values with NaN to 1, Of course you could choose any other values. The error is due to the fact that, You have NaN in the feature vector. You might need to import necessary packages. This should help too.
Let me know if this fails.
Upvotes: 2