Reputation: 867
I've came across the following problem:
Imagine we have a set of n
samples that we want to classify into k
classes labeled 1-k
. We run M
different clustering algorithms and get M
different outputs. The catch is that the same clusters in different outputs can be given a different label in each output.
How to find the common clusters between all outputs? The obious solution I think is to run over all possible pairs of samples, checking whether they are classified the same in each ouput. That gives complexity of O(n^2*M)
.
Can we do better (maybe if we add some assumptions)?
Thanks.
EDIT
I'll give an example. We have 4 samples, k=2 and got the following outputs:
A 1 1 2
B 1 1 2
C 2 2 1
D 1 1 1
Than the only common cluster is (A,B) since it s the only pair that is classified the same in all outputs.
Upvotes: 3
Views: 428
Reputation: 77454
Analyze the clusters, not the data points, for example by computing the contingency table.
This will easily get you down to O(M*k*k*n)
plus O(n log n)
for sorting your cluster contents once, if you don't have them presorted already (for efficient intersection).
I don't think it pays off to analyze k-means results though. On reasonably complex data, they are only as good as random convex partitionings.
Upvotes: 0
Reputation: 6246
from what i get is that you need to check whether any two outputs are actually structurally similar but you can only think of O(n^2) algorithm to do that. If your problem is the above one then an optimization is as follows :-
Psuedo Code :-
int arr1 = [1 1 2 2];
int arr2 = [2 2 1 1];
list sets1[k];
list sets2[k];
for(int i=0;i<n;i++) {
sets1[arr1[i]-1].add(i);
sets2[arr2[i]-1].add(i);
}
boolean flag = true;
for(int i=0;i<k;i++) {
flag = flag && compare(sets1[arr1[i]-1],sets2[arr2[i]-1]);
if(flag == false)
return flag
}
return flag
Time Complexity :-
The compare function visits all elements in arr1 & arr2 atmost once hence it is O(n)
overall.
Edit :-
Further if you need to evaluate whether all such outputs which are similar in less than O(M^2*n)
then :-
1. calculate sets for all M
2. Calculate hash for each set using standard hash functions.
3. if two set are equal then their hashes are also equal with high probability
4. Sort k hash for each output in O(logk)
5. Get all equivalent set using hash map in O(M*logM)
Overall Complexity :- O(n*M)
for sets calculation and O(M*logM)
for getting similar outputs hence O(M*(n+logM))
Upvotes: 2
Reputation: 41474
Min-Hash can be used to efficiently estimate the similarity between two clusters. Its time is linear in the number of elements, so it'd bring your runtime down to O(n*k^2*j) (where j is the number of hash functions used by Min-Hash, and higher values give more accurate results).
Upvotes: 0
Reputation: 19601
For each sample, think of the outputs of the M clustering algorithms as the M characters of a string. Now you have n strings of length M and you need to find duplicates. One practical way of doing this to to compute a hash code for each string - in fact you could build a table mapping the hash code to a list of strings with that hash code. Strings with different hash codes must be different. If you have a collection of strings with the same hash code, start by comparing each of them with the first string with that hash code. If they are all the same you have confirmed that the hash code did not produce misleading collisions quickly. If they are not all the same, you have a sub-collection that are the same as the first string, and another sub-collection where you will have to repeat the comparisons.
If the hash code does not produce misleading collisions you can divide the strings into clusters in linear time. If it does, you may take quadratic time as above.
A linear time solution which may be impractical is to concatenate the strings, separating them by a character not seen so far, and then run a linear-time suffix tree or suffix array creation program. This will sort the strings into tree or array order, and you can then find divisions between clusters by going through the strings in order comparing each string with the next string in order.
Upvotes: 0