Grouping Data into Clusters Based on DataFrame Columns

Question

I have a DataFrame (df) that resembles the following:

I would like to add a column that essentially determines a related cluster based upon the values in columns A and B:

A    B    C   
1    2    a
1    3    a
1    4    a
2    5    a
3    1    a
3    2    a
4    6    a
4    7    a
8    9    b
9    8    b

Note that since 1 (in A) is related to 2 (in B), and 2 (in A) is related to 5 (in B), these are all placed in the same cluster. 8 (in A) is only related to 9 (in B) and are therefore placed in another cluster.

To sum up, how do I define clusters based upon pairwise connections where pairs are defined by two columns in a DataFrame?

DSM · Accepted Answer

You can view this as a set consolidation problem (with each row describing a set) or a connected component problem (with each row describing an edge between two nodes). AFAIK there's no native support for this, although I've considered submitting a PR adding it to the utility tools.

Anyway, you could do something like:

def consolidate(sets):
    # http://rosettacode.org/wiki/Set_consolidation#Python:_Iterative
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

def group_ids(pairs):
    groups = consolidate(map(set, pairs))
    d = {}
    for i, group in enumerate(sorted(groups)):
        for elem in group:
            d[elem] = i
    return d

after which we have

>>> df["C"] = df["A"].replace(group_ids(zip(df.A, df.B)))
>>> df
   A  B  C
0  1  2  0
1  1  3  0
2  1  4  0
3  2  5  0
4  4  6  0
5  4  7  0
6  8  9  1
7  9  8  1

and you can replace the 0s and 1s by whatever you want.

Grouping Data into Clusters Based on DataFrame Columns

Answers (2)

Related Questions