maxipod
maxipod

Reputation: 89

How to cluster pairs without networkx

I have a list of lists:

[['suzy', 'joe', ...], ['suzy', 'tom', ...], ['paul', 'kristy',...], ['kristy', 'griffin',...], ...]

My desired output is to cluster the pairs into two lists:

['suzy', 'joe', 'tom']
['paul', 'kristy', 'griffin']

I read this post that had a similar question, but it was answered with networkx. However, I am avoiding use of networkx, and have been struggling with doing this.

I tried creating a dictionary with one name, and then appending other names if they were in a pair, but it was getting messy. Any help is appreciated!

Upvotes: 3

Views: 257

Answers (1)

BrokenBenchmark
BrokenBenchmark

Reputation: 19253

The networkx solution you were referring to created an undirected graph, where the nodes were names and the edges were observed pairings. The clusters are then the connected components of the graph, which you can read off using a graph traversal algorithm (e.g. DFS, BFS).

So, you have two options:

  1. Construct your own graph implementation, without using networkx, doing the above. Nothing I've just described are things that you can only do with networkx.

  2. Use the disjoint sets data structure to generate the groupings. As the name suggests, we maintain a collection of disjoint sets (where a set represents a cluster of people). Whenever we see a pair, we union the clusters that the two people in the pairing originally belonged to. An implementation of union-find, as well as its application to the problem you've described, is given below:

# Union-find by size.
def find(data, target):
    if data[target] < 0:
        return target
    data[target] = find(data, data[target])
    return data[target]

def union(data, first, second):
    first_root, second_root = find(data, first), find(data, second)
    if first_root == second_root:
        return False
    elif first_root < second_root:
        data[first_root] += data[second_root]
        data[second_root] = first_root
        return True
    else:
        data[second_root] += data[first_root]
        data[first_root] = second_root
        return True

data = [['suzy', 'joe'], ['suzy', 'tom'], ['paul', 'kristy'], ['kristy', 'griffin']]

# Extract list of names from data.
name_set = set()
for pair in data:
    name_set.add(pair[0])
    name_set.add(pair[1])

name_list = list(name_set)
name_table = {name: index for index, name in enumerate(name_list)}

# Find disjoint sets.
disjoint_sets = [-1] * len(name_table)
for pair in data:
    union(disjoint_sets, name_table[pair[0]], name_table[pair[1]])

# Read off disjoint sets into lists.
result = {}
for name in name_table:
    set_id = find(disjoint_sets, name_table[name])
    if set_id not in result:
        result[set_id] = []
    result[set_id].append(name)

print(list(result.values()))

Upvotes: 1

Related Questions