Aymen Rahal
Aymen Rahal

Reputation: 125

Failed to execute user defined function(VectorAssembler

I am working with a Kmeans as clustering algorithme, my code want execute and showing me this error:

org.apache.spark.SparkException: Failed to execute user defined function(VectorAssembler$$Lambda$1525/671078904: (struct<latitude:double,longitude:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

here is the dataframe code:

val st = stations
    .withColumn("longitude", $"longitude".cast(sql.types.DoubleType))
    .withColumn("latitude", $"latitude".cast(sql.types.DoubleType))
val stationVA = new VectorAssembler()
    .setInputCols(Array("latitude","longitude"))
    .setOutputCol("location")
val stationWithLoc =stationVA.transform(st)

println("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'location'")
stationWithLoc.select("name", "position").show(false)

stationWithLoc.printSchema()
stationWithLoc.show()

for the Schema it work but in case if I put the show I am getting the issue.

Upvotes: 7

Views: 7057

Answers (2)

caring-goat-913
caring-goat-913

Reputation: 4049

This question is old, but I just ran into this issue with pyspark.

I believe the error is related to null values in the data. Doing a fillna() on my columns before using VectorAssembler resolved the error.

Upvotes: 15

Siddaram H
Siddaram H

Reputation: 1176

For me, The issue was with data, I was using a csv file where it had a new line in the middle of the row. After updating. Check the data by df.head(1) whether it read all the columns correctly.

Upvotes: 2

Related Questions