Reputation: 139
import pandas as pd
data={'x':['A','A','B','B','C','E','F'],
'y':['B','C','A','C','D','F','G']}
df=pd.DataFrame(data)
print(df)
I have a big dataframe like this one (simplified with ABC):
x y
0 A B
1 A C
2 B A
3 B C
4 C D
5 E F
6 F G
There are "loops" like row 0: A <-> B and row 2: B <-> A which mean the same relation for me.
I want to have the relation between the x and y column values and give them a unique new id.
So for this example table this means:
A = B = C = D give this a unique id, i.e. 90 E = F = G give this a unique id, i.e. 91
The Result table i need should be:
id value
0 90 A
1 90 B
2 90 C
3 90 D
4 91 E
5 91 F
6 91 G
How can i achieve this with pandas? Help will be very much appreciated!
Upvotes: 3
Views: 334
Reputation: 153460
This seems like a graph, ie networkx library, problem. Let's look for nodes in connected components within a graph network (see this wiki page).
import pandas as pd
import networkx as nx
data={'x':['A','A','B','B','C','E','F'],
'y':['B','C','A','C','D','F','G']}
df=pd.DataFrame(data)
G = nx.from_pandas_edgelist(df, 'x','y')
g = nx.connected_components(G)
S = pd.Series()
for i,n in enumerate(g):
s = pd.Series(sorted(list(n)), index=[i]*len(n))
S = pd.concat([S, s])
S
Output:
0 A
0 B
0 C
0 D
1 E
1 F
1 G
dtype: object
Upvotes: 6