Reputation: 2507
I have a DF,
name Id
0 ABC i1
1 ABC i2
2 PQR i3
3 XYZ i1
4 XYZ i2
I want to find, name
s that shares the same at least two Id
s. I want to build a graph with weighted edges and weight is dependent on if two or more Id
s share more than one name.
So, in this example there will be i1
and i2
having ABC
and XYZ
common.
I tried to achieve this by multiindexing int Pandas, where I created a new DF like following,
Id name
i1 ABC
ABC
i3 PQR
i2 XYZ
XYZ
I am completely lost from here on. Thank you!
Expected output would be:
Id1 Id2 count
0 i1 i2 2
1 i3 Null Null
count
will be the number of times there are Id
s i1
and i2
having number of name
s common.
EDITED DF
name Id
0 ABC i1
1 ABC i2
2 PQR i3
3 XYZ i1
4 QWE i2
##########
Input DF, used for experimenting (dups_df
)
name Id
0 ABC i1
1 ABC i2
2 QWE i1
3 PQR i3
4 XYZ i2
RESULTS
Of this DF,
dups_df['UniqueID'] = dups_df['Id'].map(node2id)
name Id UniqueId
0 ABC i1 1
1 ABC i2 1
2 QWE i1 1
3 PQR i3 2
4 XYZ i2 1
dups_df=dups_df.groupby('UniqueID').agg({'name':'nunique', 'Id':'unique'})
UniqueId Name Id
1 3 [i1,i2]
2 1 i3
df1 = pd.DataFrame(dups_df.pop('investors_uuids').tolist()).rename(columns= lambda x: "investors_uuids{}".format(x+1))
df1['count'] = dups_df['organization_name'].mask(dups_df['organization_name'].lt(2)).to_numpy()
**outputs**
Id1 Id2 count
0 i1 i2 3
1 i3 None NaN
The last output DF is wrong, the count at index 0 should be NaN or 1 since in the new DF defined abouve called dups_df
there is no other name
common that is shared by i1
and i2
except ABC
so the count should be 1 or NAN not 3!
My Problem I have Ids and names columns, I want to build a network graph where I connect two Ids with an edge and the weight of the edge is defined if the two Ids(nodes) share more than one name.
Upvotes: 1
Views: 61
Reputation: 863611
Use networkx
with connected_components
:
import networkx as nx
# Create the graph from the dataframe
g = nx.Graph()
g.add_edges_from(df[['name','Id']].itertuples(index=False))
connected_components = nx.connected_components(g)
#create dict for common values
node2id = {}
for cid, component in enumerate(connected_components):
for node in component:
node2id[node] = cid + 1
df['UniqueID'] = df['Id'].map(node2id)
print (df)
0 ABC i1 1
1 ABC i2 1
2 PQR i3 2
3 XYZ i1 1
4 XYZ i2 1
Then aggregate DataFrameGroupBy.nunique
and SeriesGroupBy.nunique
ad create new DataFrame:
df = df.groupby('UniqueID').agg({'name':'nunique', 'Id':'unique'})
print (df)
name Id
UniqueID
1 2 [i1, i2]
2 1 [i3]
df1 = pd.DataFrame(df.pop('Id').tolist()).rename(columns = lambda x: f'Id{x+1}')
df1['count'] = df['name'].mask(df['name'].lt(2)).to_numpy()
print (df1)
Id1 Id2 count
0 i1 i2 2.0
1 i3 None NaN
Upvotes: 4