Reputation: 89
I have a list of lists:
[['suzy', 'joe', ...], ['suzy', 'tom', ...], ['paul', 'kristy',...], ['kristy', 'griffin',...], ...]
My desired output is to cluster the pairs into two lists:
['suzy', 'joe', 'tom']
['paul', 'kristy', 'griffin']
I read this post that had a similar question, but it was answered with networkx
. However, I am avoiding use of networkx
, and have been struggling with doing this.
I tried creating a dictionary with one name, and then appending other names if they were in a pair, but it was getting messy. Any help is appreciated!
Upvotes: 3
Views: 257
Reputation: 19253
The networkx
solution you were referring to created an undirected graph, where the nodes were names and the edges were observed pairings. The clusters are then the connected components of the graph, which you can read off using a graph traversal algorithm (e.g. DFS, BFS).
So, you have two options:
Construct your own graph implementation, without using networkx
, doing the above. Nothing I've just described are things that you can only do with networkx
.
Use the disjoint sets data structure to generate the groupings. As the name suggests, we maintain a collection of disjoint sets (where a set represents a cluster of people). Whenever we see a pair, we union the clusters that the two people in the pairing originally belonged to. An implementation of union-find, as well as its application to the problem you've described, is given below:
# Union-find by size.
def find(data, target):
if data[target] < 0:
return target
data[target] = find(data, data[target])
return data[target]
def union(data, first, second):
first_root, second_root = find(data, first), find(data, second)
if first_root == second_root:
return False
elif first_root < second_root:
data[first_root] += data[second_root]
data[second_root] = first_root
return True
else:
data[second_root] += data[first_root]
data[first_root] = second_root
return True
data = [['suzy', 'joe'], ['suzy', 'tom'], ['paul', 'kristy'], ['kristy', 'griffin']]
# Extract list of names from data.
name_set = set()
for pair in data:
name_set.add(pair[0])
name_set.add(pair[1])
name_list = list(name_set)
name_table = {name: index for index, name in enumerate(name_list)}
# Find disjoint sets.
disjoint_sets = [-1] * len(name_table)
for pair in data:
union(disjoint_sets, name_table[pair[0]], name_table[pair[1]])
# Read off disjoint sets into lists.
result = {}
for name in name_table:
set_id = find(disjoint_sets, name_table[name])
if set_id not in result:
result[set_id] = []
result[set_id].append(name)
print(list(result.values()))
Upvotes: 1