Reputation: 7587
I have a dictionary with this format
items = {"0":'lorem ipsum', "1":'blah blah', "2":'blah", "3":'lorem',.....}
Let's say that len(items) = 70
I have a 70x70 matrix with values for each pair of the dictionary. For instance, the matrix[4][20]
representing the value between item 4 and 20 from the dictionary.
I want to cluster the dictionary items and put in the same group all items with a value more than a certain threshold. I don't know if my way is the best way. Maybe there is a library that can do the same thing much quicker than mine. Since my matrix could be 10000x10000.
size = range(0,len(items))
t = 1
mydict = {}
for i in size:
mylist = items[i]
size.remove(i)
for j in size:
if items[j] in lis:
pass
elif matrix[i][j] > 0.50:
mylist.append(items[j])
size.remove(j)
mydict[t] = mylist
t = t + 1
mylist = []
And then I print all the groups with more than 5 items.
for i in mydict:
if len(mydict[i]) > 5:
print mydict[i]
print ""
Do you see any error? Do know any library or a better algorithm to do this?
UPDATED Here is a better example:
items = {"0": 'item 0', "1":'item 1', "2":'item 2', "3":'item 3', "4":'item 4', "5":'item 5'}
matrix = [[1.0 0.2 0.7 0.8 0.3 0.1],
[0.2 1.0 0.2 0.3 0.6 0.2],
[0.7 0.2 1.0 0.7 0.2 0.3],
[0.8 0.3 0.7 1.0 0.2 0.4],
[0.3 0.6 0.2 0.2 1.0 0.1],
[0.1 0.2 0.3 0.4 0.1 1.0]]
**The results:**
[item 0, item 2, item 3]
[item 1, item 4]
[item 5]
The matrix is triangular and all the points in the diagonial are 1.
Upvotes: 0
Views: 252
Reputation: 77505
This sounds like you are trying to do hierarchical linkage clustering with single-link, and then make a cut with the given threshold.
This is essentially equivalent to the minimum spanning tree (MST) problem, and that is also about as fast as you can get: O(n^2)
where n
is the number of items.
So I believe your algorithm is reasonable. Note that the SciPy implementation of hierarchical clustering only supports Ward linkage, not single-linkage, and probably has O(n^3)
complexity.
If you want to make things quicker, try to not build the matrix the first place. But keep a set of clusters, and join these appropriately by adding one element at a time. Essentially, for each new item, find all items that are above your threshold. Then you can get three cases: 1. there is no other cluster nearby. Then start a new cluster with the current item. 2. there is exactly one cluster, add the current item to this cluster. 3. there is more than one cluster. Merge all of these, and then add the current item.
class cluster:
def __init__(self, first):
self.members = [first]
mydict = {}
for i in range(0,len(items)):
nearby = set()
for j in range(0, i):
if matrix[i][j] > 0.5: nearby.add(mydict[j])
if len(nearby) == 0:
mydict[i] = cluster(i)
else:
mydict[i] = nearby.pop()
mydict[i].members.append(i)
for other in nearby: # merge other clusters
mydict[i].members.extend(other.members)
for k in other.members: mydict[k] = mydict[i]
Note that I am deliberately keeping track only of integers. Saving memory is key to handling large data sets, and integers are just the most efficient entity you have.
Upvotes: 2