Reputation: 1222
I'm trying to do a kmeans clustering algorithm from apache Spark's mlib library. I have everything setup but I'm not exactly sure how would I go about formatting the input data. I'm relatively new to machine learning so any help would be appreciated.
In the sample data.txt the data is as follows:
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
And the data that I want to run the algorithm on is in this format for now (json array):
[{"customer":"ddf6022","order_id":"20031-19958","asset_id":"dd1~33","price":300,"time":1411134115000,"location":"bt2"},{"customer":"ddf6023","order_id":"23899-23825","asset_id":"dd1~33","price":300,"time":1411954672000,"location":"bt2"}]
How can I convert it into something that can be used with the k-means clustering algorithm? I'm using Java, also I'm guessing I need it to be in a JavaRDD format, but have no idea how to go about doing it.
Upvotes: 6
Views: 1936
Reputation: 1586
How this works:
First of all, you have to define on what dimensions you would like to apply KMeans, the KMeans example included on Spark documentation is applied on a data set of 3D points (X Y & Z dimensions). take into accoint that the KMeans implementation on MLLib is able to work on sets of n dimensions where n>=1
A Proposal:
So lets say, for your input, the X Y & Z dimensions are going to be the JSON fields: price, time & location. then, all you have to do is to extract those dimensions from your data set and put these in a text file as follows:
300 1411134115000 2
300 1411954672000 2
...
...
...
Where location "bt2" has been replace by 2 (assuming that your data set has another locations). You have to provide numeric values to KMeans.
Notes/Ideas:
For better clustering results and depending on the data time distribution, It would be nice if you take advantage of the timestamp field by transforming it to values: Year , Month , Day , Hour, Minute, Second, etc. So, you could play with different dimensions as separate fields depending on your clustering purpose.
Also, I guess you would like to make automatic JSON2CSV conversion process. So, in your mapping implementation you could use an approach like this: https://stackoverflow.com/a/15411074/833336
Upvotes: 3