Reputation: 226
In python, how can I create a column whose values represent groups, where groups are based on ties indicated by the names of persons (col1) and associates (col2)?
The dataset is longitudinal / panel / time-series. There are many missing values in the associate column. But for some persons, there are many values in the associate col. Associates are probably linked through persons, and this is what I hope the group column will clarify. I.e. there are only needed as many groups, so as to group persons that can be linked through other persons somehow.
How do I go about handling this? It seems to involve at least one loop/iteration, which I am not awfully skilled at writing.
index | person | associate | desired column: group |
---|---|---|---|
1 | name1 | - | 1 |
2 | name1 | - | 1 |
3 | name1 | - | 1 |
1 | name2 | name1 | 1 |
2 | name2 | name4 | 1 |
3 | name2 | name3 | 1 |
1 | name3 | - | 1 |
2 | name3 | - | 1 |
3 | name3 | - | 1 |
1 | name4 | - | 1 |
2 | name4 | - | 1 |
3 | name4 | - | 1 |
1 | name5 | - | - |
2 | name5 | - | - |
3 | name5 | - | - |
Upvotes: 1
Views: 39
Reputation: 260430
You need to find the connected_components
in a graph, use networkx
, then map
you groups:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='person', target='associate')
G.remove_node('-')
groups = {n: i if len(g)>1 else '-' for i, g in
enumerate(nx.connected_components(G), start=1)
for n in g}
df['group'] = df['person'].map(groups)
Output:
index person associate group
0 1 name1 - 1
1 2 name1 - 1
2 3 name1 - 1
3 1 name2 name1 1
4 2 name2 name4 1
5 3 name2 name3 1
6 1 name3 - 1
7 2 name3 - 1
8 3 name3 - 1
9 1 name4 - 1
10 2 name4 - 1
11 3 name4 - 1
12 1 name5 - -
13 2 name5 - -
14 3 name5 - -
Your data as a graph:
Upvotes: 2