DrTRD
DrTRD

Reputation: 1718

Grouping Data into Clusters Based on DataFrame Columns

I have a DataFrame (df) that resembles the following:

A    B   
1    2
1    3
1    4
2    5
4    6
4    7
8    9
9    8

I would like to add a column that essentially determines a related cluster based upon the values in columns A and B:

A    B    C   
1    2    a
1    3    a
1    4    a
2    5    a
3    1    a
3    2    a
4    6    a
4    7    a
8    9    b
9    8    b

Note that since 1 (in A) is related to 2 (in B), and 2 (in A) is related to 5 (in B), these are all placed in the same cluster. 8 (in A) is only related to 9 (in B) and are therefore placed in another cluster.

To sum up, how do I define clusters based upon pairwise connections where pairs are defined by two columns in a DataFrame?

Upvotes: 1

Views: 3473

Answers (2)

DSM
DSM

Reputation: 353059

You can view this as a set consolidation problem (with each row describing a set) or a connected component problem (with each row describing an edge between two nodes). AFAIK there's no native support for this, although I've considered submitting a PR adding it to the utility tools.

Anyway, you could do something like:

def consolidate(sets):
    # http://rosettacode.org/wiki/Set_consolidation#Python:_Iterative
    setlist = [s for s in sets if s]
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return [s for s in setlist if s]

def group_ids(pairs):
    groups = consolidate(map(set, pairs))
    d = {}
    for i, group in enumerate(sorted(groups)):
        for elem in group:
            d[elem] = i
    return d

after which we have

>>> df["C"] = df["A"].replace(group_ids(zip(df.A, df.B)))
>>> df
   A  B  C
0  1  2  0
1  1  3  0
2  1  4  0
3  2  5  0
4  4  6  0
5  4  7  0
6  8  9  1
7  9  8  1

and you can replace the 0s and 1s by whatever you want.

Upvotes: 4

ajerneck
ajerneck

Reputation: 751

Here is a start (I'm not sure I understood the criteria for grouping into clusters, but, you should be able to add the exact criteria):

import pandas as pd

x = pd.DataFrame({'A': [1,1,1,2,4,4,8,9],
              'B': [2,3,4,5,6,7,9,8]})

## calculate difference between a and be columns
## (substitute any distance/association function)
x['Diff'] = abs(x['A'] - x['B'])

## assign whether row is in a cluster or not.
x['Incluster'] = x['Diff'] <= 1

Upvotes: 0

Related Questions