steff9488
steff9488

Reputation: 139

How to get unique Relation between 2 columns after Self Join Pandas?

import pandas as pd

data={'x':['A','A','B','B','C','E','F'],
      'y':['B','C','A','C','D','F','G']}
df=pd.DataFrame(data)

print(df)

I have a big dataframe like this one (simplified with ABC):

     x    y
0    A    B
1    A    C
2    B    A
3    B    C
4    C    D
5    E    F
6    F    G

There are "loops" like row 0: A <-> B and row 2: B <-> A which mean the same relation for me.

I want to have the relation between the x and y column values and give them a unique new id.

So for this example table this means:

A = B = C = D give this a unique id, i.e. 90 E = F = G give this a unique id, i.e. 91

The Result table i need should be:

    id  value
0   90    A
1   90    B
2   90    C 
3   90    D
4   91    E
5   91    F
6   91    G

How can i achieve this with pandas? Help will be very much appreciated!

Upvotes: 3

Views: 334

Answers (1)

Scott Boston
Scott Boston

Reputation: 153460

This seems like a graph, ie networkx library, problem. Let's look for nodes in connected components within a graph network (see this wiki page).

import pandas as pd
import networkx as nx

data={'x':['A','A','B','B','C','E','F'],
      'y':['B','C','A','C','D','F','G']}
df=pd.DataFrame(data)
G = nx.from_pandas_edgelist(df, 'x','y')
g = nx.connected_components(G)
S = pd.Series()
for i,n in enumerate(g):
    s = pd.Series(sorted(list(n)), index=[i]*len(n))
    S = pd.concat([S, s])

S

Output:

0    A
0    B
0    C
0    D
1    E
1    F
1    G
dtype: object

Upvotes: 6

Related Questions