Predict new data based on previously clustered set

Question

I have a large set of binary data that I need to cluster. For example

[[0 1 1 0 ... 0 1 0 1 ],
 [1 0 1 1 ... 0 0 1 1 ],
 ...
 [0 0 1 0 ... 1 0 1 1 ]]

From what I've read, the best clustering algorithms for binary data are hierarchical such as agglomerative clustering. So I implemented that using scikit.

I have a very large data set with new data coming in all the time which I would like to cluster into a previously clustered group. So my thinking was to take a random sample of the existing data, run the AgglomerativeClustering on it and save the results to a file using joblib.

Then when a new set of data arrives, load the previously cluster up and call predict() to figure out where it would fall. It's almost like I'm training a cluster similar to a classifier but without the labels. The problem is that AgglomerativeClustering doesn't have a predict() method. Other clustering algorithms in scikit do have predict() such as KMeans but based on my research, that's not a good algorithm to use when dealing with binary data.

So I'm stuck. I don't want to have to run the clustering every single time new data arrives because hierarchical algorithms to do scale well with a lot of data but I'm not sure which algorithm to use that would work with binary data and also provide a predict() functionality.

Is there a way I can transform the binary data so that other algorithms, like KMeans, can provide useful outputs? Or is there a completely different algorithm not implemented in scikit that would work? I'm not tied to scikit so switching is not an issue.

Predict new data based on previously clustered set

Answers (1)

Related Questions