JQPx
JQPx

Reputation: 69

Predict new data based on previously clustered set

I have a large set of binary data that I need to cluster. For example

[[0 1 1 0 ... 0 1 0 1 ],
 [1 0 1 1 ... 0 0 1 1 ],
 ...
 [0 0 1 0 ... 1 0 1 1 ]]

From what I've read, the best clustering algorithms for binary data are hierarchical such as agglomerative clustering. So I implemented that using scikit.

I have a very large data set with new data coming in all the time which I would like to cluster into a previously clustered group. So my thinking was to take a random sample of the existing data, run the AgglomerativeClustering on it and save the results to a file using joblib.

Then when a new set of data arrives, load the previously cluster up and call predict() to figure out where it would fall. It's almost like I'm training a cluster similar to a classifier but without the labels. The problem is that AgglomerativeClustering doesn't have a predict() method. Other clustering algorithms in scikit do have predict() such as KMeans but based on my research, that's not a good algorithm to use when dealing with binary data.

So I'm stuck. I don't want to have to run the clustering every single time new data arrives because hierarchical algorithms to do scale well with a lot of data but I'm not sure which algorithm to use that would work with binary data and also provide a predict() functionality.

Is there a way I can transform the binary data so that other algorithms, like KMeans, can provide useful outputs? Or is there a completely different algorithm not implemented in scikit that would work? I'm not tied to scikit so switching is not an issue.

Upvotes: 0

Views: 1405

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

When you want to predict, use a classifier, not clustering.

Here, the most appropriate classifier would likely be a 1NN classifier. For performance reasons I'd choose DT or SVM instead though.

Upvotes: 0

Related Questions