chippycentra
chippycentra

Reputation: 879

Use pandas to cluster in a dataframe

I need help in order to deal with pandas and tab Here is a tab :

Col1    Col2
A   B
C   B
D   B
E   F
G   F
F   A
Z   Y
H   Y
L   P

From this tab I would like to create clusters and get a new tab such as:

Cluster Names
Cluster1    A
Cluster1    B
Cluster1    C
Cluster1    D
Cluster1    F
Cluster1    E
Cluster1    G
Cluster2    Z
Cluster2    Y
Cluster2    H
Cluster3    L
Cluster3    P

As you can see the letters A B C D E F and G are in the Cluster1 because they all have something in common.

`A` and `B` are in the same line (A and B forme the `Cluster1`)
`C` and `B` are in the same line (C includes the `Cluster1`)
`D` and `B` are in the same line (D includes the `Cluster1`)
`F` and `A` are in the same line (F includes the `Cluster1`)
`E` and `F` are in the same line (E includes the `Cluster1`)
`G` and `F` are in the same line (G includes the `Cluster1`)

`Z` and `Y` are in the same line (Z and Y create the `Cluster2`)
`H` and `Y` are in the same line (H includes the `Cluster2`)

`L` and `P` are in the same line (L and P create the `Cluster3`)

Does someone have an idea using pandas ?

Upvotes: 1

Views: 761

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61910

This is a graph problem know as connected components, I suggest you use networkx.connected_components:

import networkx as nx

g = nx.from_pandas_edgelist(df, source='Col1', target='Col2', create_using=nx.Graph)

for component in nx.connected_components(g):
    print(component)

Output

{'E', 'G', 'C', 'D', 'F', 'A', 'B'}
{'Y', 'H', 'Z'}
{'L', 'P'}

Notice that the components match the groups of your output. To convert it to a DataFrame do the following:

data = [[f'Cluster{i}', element] for i, component in enumerate(nx.connected_components(g), 1) for element in component]

result = pd.DataFrame(data=data, columns=['Cluster', 'Names'])
print(result)

Output

     Cluster Names
0   Cluster1     D
1   Cluster1     A
2   Cluster1     B
3   Cluster1     G
4   Cluster1     C
5   Cluster1     F
6   Cluster1     E
7   Cluster2     Z
8   Cluster2     Y
9   Cluster2     H
10  Cluster3     L
11  Cluster3     P

Upvotes: 3

Related Questions