Reputation: 1718
I have a DataFrame (df) that resembles the following:
A B
1 2
1 3
1 4
2 5
4 6
4 7
8 9
9 8
I would like to add a column that essentially determines a related cluster based upon the values in columns A and B:
A B C
1 2 a
1 3 a
1 4 a
2 5 a
3 1 a
3 2 a
4 6 a
4 7 a
8 9 b
9 8 b
Note that since 1 (in A) is related to 2 (in B), and 2 (in A) is related to 5 (in B), these are all placed in the same cluster. 8 (in A) is only related to 9 (in B) and are therefore placed in another cluster.
To sum up, how do I define clusters based upon pairwise connections where pairs are defined by two columns in a DataFrame?
Upvotes: 1
Views: 3473
Reputation: 353059
You can view this as a set consolidation problem (with each row describing a set) or a connected component problem (with each row describing an edge between two nodes). AFAIK there's no native support for this, although I've considered submitting a PR adding it to the utility tools.
Anyway, you could do something like:
def consolidate(sets):
# http://rosettacode.org/wiki/Set_consolidation#Python:_Iterative
setlist = [s for s in sets if s]
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return [s for s in setlist if s]
def group_ids(pairs):
groups = consolidate(map(set, pairs))
d = {}
for i, group in enumerate(sorted(groups)):
for elem in group:
d[elem] = i
return d
after which we have
>>> df["C"] = df["A"].replace(group_ids(zip(df.A, df.B)))
>>> df
A B C
0 1 2 0
1 1 3 0
2 1 4 0
3 2 5 0
4 4 6 0
5 4 7 0
6 8 9 1
7 9 8 1
and you can replace the 0s and 1s by whatever you want.
Upvotes: 4
Reputation: 751
Here is a start (I'm not sure I understood the criteria for grouping into clusters, but, you should be able to add the exact criteria):
import pandas as pd
x = pd.DataFrame({'A': [1,1,1,2,4,4,8,9],
'B': [2,3,4,5,6,7,9,8]})
## calculate difference between a and be columns
## (substitute any distance/association function)
x['Diff'] = abs(x['A'] - x['B'])
## assign whether row is in a cluster or not.
x['Incluster'] = x['Diff'] <= 1
Upvotes: 0