generate group col using names in two cols (reflecting ties)?

Question

In python, how can I create a column whose values represent groups, where groups are based on ties indicated by the names of persons (col1) and associates (col2)?

The dataset is longitudinal / panel / time-series. There are many missing values in the associate column. But for some persons, there are many values in the associate col. Associates are probably linked through persons, and this is what I hope the group column will clarify. I.e. there are only needed as many groups, so as to group persons that can be linked through other persons somehow.

How do I go about handling this? It seems to involve at least one loop/iteration, which I am not awfully skilled at writing.

index	person	associate	desired column: group
1	name1	-	1
2	name1	-	1
3	name1	-	1
1	name2	name1	1
2	name2	name4	1
3	name2	name3	1
1	name3	-	1
2	name3	-	1
3	name3	-	1
1	name4	-	1
2	name4	-	1
3	name4	-	1
1	name5	-	-
2	name5	-	-
3	name5	-	-

mozway · Accepted Answer

You need to find the connected_components in a graph, use networkx, then map you groups:

import networkx as nx

G = nx.from_pandas_edgelist(df, source='person', target='associate')
G.remove_node('-')

groups = {n: i if len(g)>1 else '-' for i, g in
          enumerate(nx.connected_components(G), start=1)
          for n in g}

df['group'] = df['person'].map(groups)

Output:

    index person associate group
0       1  name1         -     1
1       2  name1         -     1
2       3  name1         -     1
3       1  name2     name1     1
4       2  name2     name4     1
5       3  name2     name3     1
6       1  name3         -     1
7       2  name3         -     1
8       3  name3         -     1
9       1  name4         -     1
10      2  name4         -     1
11      3  name4         -     1
12      1  name5         -     -
13      2  name5         -     -
14      3  name5         -     -

Your data as a graph:

generate group col using names in two cols (reflecting ties)?

Answers (1)

Related Questions