Feenster
Feenster

Reputation: 81

Combine a list of pairs (tuples)?

From a list of linked pairs, I want to combine the pairs into groups of common IDs, so that I can then write group_ids back to the database, eg:

UPDATE table SET group = n WHERE id IN (...........);

Example:

[(1,2), (3, 4), (1, 5), (6, 3), (7, 8)]

becomes

[[1, 2, 5], [3, 4, 6], [7, 8]]

Which allows:

UPDATE table SET group = 1 WHERE id IN (1, 2, 5);
UPDATE table SET group = 2 WHERE id IN (3, 4, 6);
UPDATE table SET group = 3 WHERE id IN (7, 8);

and

[(1,2), (3, 4), (1, 5), (6, 3), (7, 8), (5, 3)]

becomes

[[1, 2, 5, 3, 4, 6], [7, 8]]

Which allows:

UPDATE table SET group = 1 WHERE id IN (1, 2, 5, 3, 4, 6);
UPDATE table SET group = 2 WHERE id IN (7, 8);

I have written some code which works. I pass in a list of tuples, where each tuple is a pair of linked ids. I return a list of lists, where each internal list is a list of common id's.

I iterate over the list of tuples and assign each tuple element to groups as follows:

I am expecting millions of linked pairs, and I am expecting hundreds of thousand of gropus and hunderds of thousands of group members. So, I need the algorithm to be fast, I am looking for some suggestions for some really efficient code. Although I have implemented this to build a list of lists, I am open to anything, they key thing is being able to build the above SQL to get the group ids back to the database.

def group_pairs(list_of_pairs):
    """

    :param list_of_pairs:
    :return:
    """
    groups = list()
    for pair in list_of_pairs:
        a_group = None
        b_group = None

        for group in groups:
            # find what group if any a and b belong to

            # don't bother checking if a group already found
            if a_group is None and pair[0] in group:
                a_group = group
            # don't bother checking if b group already found
            if b_group is None and pair[1] in group:
                b_group = group
            # if a and b found, stop looking
            if a_group is not None and b_group is not None:
                break

        if a_group is None:
            if b_group is None:
                # neither a nor b are in a group; create a new group and
                # add a and b
                groups.append([pair[0], pair[1]])
            else:
                # b is in a group but a isn't, so add a to the b group
                b_group.append(pair[0])
        elif a_group != b_group:
            if b_group is None:
                # a is in a group but b isn't, so add b to the a group
                a_group.append(pair[1])
            else:
                # a and b are in different groups, add join b to a and get
                # rid of b
                a_group.extend(b_group)
                groups.remove(b_group)
        elif a_group == b_group:
            # a and b already in same group, so nothing to do
            pass

    return groups

Upvotes: 2

Views: 2182

Answers (1)

wildwilhelm
wildwilhelm

Reputation: 5019

With:

def make_equiv_classes(pairs):
    groups = {}
    for (x, y) in pairs:
        xset = groups.get(x, set([x]))
        yset = groups.get(y, set([y]))
        jset = xset | yset
        for z in jset:
            groups[z] = jset
    return set(map(tuple, groups.values()))

you can get:

>>> make_equiv_classes([(1,2), (3, 4), (1, 5), (6, 3), (7, 8)])
{(1, 2, 5), (3, 4, 6), (8, 7)}

>>> make_equiv_classes([(1,2), (3, 4), (1, 5), (6, 3), (7, 8), (5, 3)])
{(1, 2, 3, 4, 5, 6), (8, 7)}

The complexity should be O(np), where n is the number of distinct integer values, and p is the number of pairs.

I think set is the proper type for a single group, because it makes the union operation fast and easy to express, and dict is the right way to store groups because you get constant-time lookup for asking the question of what group a particular integer value belongs to.

We can set up a test harness to time this code, if we want to. To start, we can build a random graph over something moderately large, like 10K nodes (i.e., distinct integer values). I'll put in 5K random links (pairs), since that tends to give me thousands of groups, which together account for about two thirds of the nodes (that is, about 3K nodes will be in a singleton groups, not linked to anything else).

import random
pairs = []
while len(pairs) < 5000:
    a = random.randint(1,10000)
    b = random.randint(1,10000)
    if a != b:
        pairs.append((a,b))

Then, we can just time this (I'm using IPython magic here):

In [48]: %timeit c = make_equiv_classes(pairs)
10 loops, best of 3: 63 ms per loop

which is faster than your initial solution:

In [49]: %timeit c = group_pairs(pairs)
1 loop, best of 3: 2.08 s per loop

We can also use this random graph to check that the output of the two functions is identical for some large random input:

>>> c = make_equiv_classes(pairs)
>>> c2 = group_pairs(pairs)
>>> set(tuple(sorted(x)) for x in c) == set(tuple(sorted(x)) for x in c2)
True

Upvotes: 6

Related Questions