Reputation: 195
Thank you for reading.
I have a dataframe which looks like this:
Col_A Col_B Col_C Col_D Col_E
1 2 null null null
1 null 3 null null
null 2 3 null null
null 2 null 4 null
1 null null null 5
Each row consists of a match between two IDs (e.g. ID1 from Col_A matches to ID2 from Col_B on the first row).
In the example above, all 5 IDs are connected (1 is connected to 2, 2 to 3, 2 to 4, 1 to 5). I therefore want to create a new column which clusters all of these rows together so that I can easily access each group of matched pairs:
Col_A Col_B Col_C Col_D Col_E Group ID
1 2 null null null 1
1 null 3 null null 1
null 2 3 null null 1
null 2 null 4 null 1
1 null null null 5 1
I have not yet been able to find a similar question, but apologies if this is a duplicate. Many thanks in advance for any advice.
Upvotes: 2
Views: 334
Reputation: 153460
As @YOBEN_S and @QuangHoang suggests, you can use networkx library and Graph Theory connnected components like this.
Given df,
df = pd.DataFrame({'Col_A': {0: 1.0, 1: 1.0, 2: np.nan, 3: np.nan, 4: 1.0, 5: np.nan},
'Col_B': {0: 2.0, 1: np.nan, 2: 2.0, 3: 2.0, 4: np.nan, 5: np.nan},
'Col_C': {0: np.nan, 1: 3.0, 2: 3.0, 3: np.nan, 4: np.nan, 5: np.nan},
'Col_D': {0: np.nan, 1: np.nan, 2: np.nan, 3: 4.0, 4: np.nan, 5: np.nan},
'Col_E': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: 5.0, 5: np.nan},
'Col_F': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: 6.0},
'Col_G': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: 7.0}})
| | Col_A | Col_B | Col_C | Col_D | Col_E | Col_F | Col_G |
|---:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|
| 0 | 1 | 2 | nan | nan | nan | nan | nan |
| 1 | 1 | nan | 3 | nan | nan | nan | nan |
| 2 | nan | 2 | 3 | nan | nan | nan | nan |
| 3 | nan | 2 | nan | 4 | nan | nan | nan |
| 4 | 1 | nan | nan | nan | 5 | nan | nan |
| 5 | nan | nan | nan | nan | nan | 6 | 7 |
Use
import networkx as nx
d_edge = df.apply(lambda x: x.dropna().to_numpy(), axis=1)
G = nx.from_edgelist(d_edge.to_numpy().tolist())
cc_list = list(nx.connected_components(G))
df['groupid'] = d_edge.apply(lambda x: [n for n, i in enumerate(cc_list) if x[0] in i][0] + 1)
df
Output:
| | Col_A | Col_B | Col_C | Col_D | Col_E | Col_F | Col_G | groupid |
|---:|--------:|--------:|--------:|--------:|--------:|--------:|--------:|----------:|
| 0 | 1 | 2 | nan | nan | nan | nan | nan | 1 |
| 1 | 1 | nan | 3 | nan | nan | nan | nan | 1 |
| 2 | nan | 2 | 3 | nan | nan | nan | nan | 1 |
| 3 | nan | 2 | nan | 4 | nan | nan | nan | 1 |
| 4 | 1 | nan | nan | nan | 5 | nan | nan | 1 |
| 5 | nan | nan | nan | nan | nan | 6 | 7 | 2 |
Upvotes: 3