Salma Elzeheiry
Salma Elzeheiry

Reputation: 37

How to apply kmeans for parquet file?

parquet file

I want to apply k-means for my parquet file.but error appear .

edited

java.lang.ArrayIndexOutOfBoundsException: 2

code

val Data = sqlContext.read.parquet("/usr/local/spark/dataset/norm")
val parsedData = Data.rdd.map(s => Vectors.dense(s.getDouble(1),s.getDouble(2))).cache()

import org.apache.spark.mllib.clustering.KMeans 
val numClusters = 30
val numIteration = 1
 val userClusterModel = KMeans.train(parsedData, numClusters, numIteration)
val userfeature1 = parsedData.first 
val userCost = userClusterModel.computeCost(parsedData)
println("WSSSE for users: " + userCost)

How to solve this error?

Upvotes: 0

Views: 191

Answers (2)

Salma Elzeheiry
Salma Elzeheiry

Reputation: 37

    val parsedData = Data.rdd.map(s => Vectors.dense(s.getInt(0),s.getDouble(1))).cache()

Upvotes: 0

Jayadeep Jayaraman
Jayadeep Jayaraman

Reputation: 2825

I believe you are using https://spark.apache.org/docs/latest/mllib-clustering.html#k-means as a reference to build your K-Means model.

In the example

val data = sc.textFile("data/mllib/kmeans_data.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()

data is of type org.apache.spark.rdd.RDD In your case sqlContext.read.parquet is of type DataFrame. Therefore you would have to convert the dataframe to RDD to perform the split operation

To convert from Dataframe to RDD you can use the below sample as reference

val rows: RDD[Row] = df.rdd

Upvotes: 1

Related Questions