Reputation: 126
I want cluster a streaming dataset using Spark. I first tried to use Kmeans but it throws a runtime exception on calling fit method saying it cannot be used with streaming data:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
Then I tried to use StreamingKmeans but it seams this model works only with legacy streaming in Spark and accepts DStream. Does anyone know a workaround for this or other solutions to this problem?
Codes I've written sofar is as follow:
Dataset<Row> df = spark.readStream()
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.selectExpr("CAST(value AS String)")
.select(functions.from_json(new Column("value"), schema).as("data"))
VectorAssembler assembler = new VectorAssembler()
df = assembler.transform(df);
StreamingKMeans kmeans = new StreamingKMeans().setK(3).setDecayFactor(1.0);
StreamingKMeansModel model = kmeans.predictOn(df);
Cannot resolve method 'predictOn(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>)
Upvotes: 1
Views: 59
Reputation: 126
Finally I found out it's not possible so I switched to DStream instead of Structured Streaming
Upvotes: 1