How to format data for the spark mlib kmeans clustering algorithm?

Question

I'm trying to do a kmeans clustering algorithm from apache Spark's mlib library. I have everything setup but I'm not exactly sure how would I go about formatting the input data. I'm relatively new to machine learning so any help would be appreciated. In the sample data.txt the data is as follows: 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2

And the data that I want to run the algorithm on is in this format for now (json array):

[{"customer":"ddf6022","order_id":"20031-19958","asset_id":"dd1~33","price":300,"time":1411134115000,"location":"bt2"},{"customer":"ddf6023","order_id":"23899-23825","asset_id":"dd1~33","price":300,"time":1411954672000,"location":"bt2"}]

How can I convert it into something that can be used with the k-means clustering algorithm? I'm using Java, also I'm guessing I need it to be in a JavaRDD format, but have no idea how to go about doing it.

How to format data for the spark mlib kmeans clustering algorithm?

Answers (1)

Related Questions