Rishabh Sahrawat
Rishabh Sahrawat

Reputation: 2507

how to find if two rows of a column exist based on another column in a DataFrame?

I have a DF,

   name  Id
0  ABC   i1
1  ABC   i2
2  PQR   i3
3  XYZ   i1
4  XYZ   i2

I want to find, names that shares the same at least two Ids. I want to build a graph with weighted edges and weight is dependent on if two or more Ids share more than one name.

So, in this example there will be i1 and i2 having ABC and XYZ common.

I tried to achieve this by multiindexing int Pandas, where I created a new DF like following,

Id   name
i1   ABC
     ABC
i3   PQR
i2   XYZ
     XYZ

I am completely lost from here on. Thank you!

Expected output would be:

   Id1 Id2   count
0  i1  i2    2
1  i3  Null  Null

count will be the number of times there are Ids i1 and i2 having number of names common.

EDITED DF

   name  Id
0  ABC   i1
1  ABC   i2
2  PQR   i3
3  XYZ   i1
4  QWE   i2

########## Input DF, used for experimenting (dups_df)

   name  Id
0  ABC   i1
1  ABC   i2
2  QWE   i1
3  PQR   i3
4  XYZ   i2

RESULTS Of this DF, dups_df['UniqueID'] = dups_df['Id'].map(node2id)

   name  Id  UniqueId
0  ABC   i1  1
1  ABC   i2  1
2  QWE   i1  1
3  PQR   i3  2
4  XYZ   i2  1

dups_df=dups_df.groupby('UniqueID').agg({'name':'nunique', 'Id':'unique'})

UniqueId  Name   Id
1         3      [i1,i2]
2         1      i3
df1 = pd.DataFrame(dups_df.pop('investors_uuids').tolist()).rename(columns= lambda x: "investors_uuids{}".format(x+1))
df1['count'] = dups_df['organization_name'].mask(dups_df['organization_name'].lt(2)).to_numpy()

**outputs**
  Id1  Id2  count
0 i1   i2   3
1 i3   None NaN

The last output DF is wrong, the count at index 0 should be NaN or 1 since in the new DF defined abouve called dups_df there is no other name common that is shared by i1 and i2 except ABC so the count should be 1 or NAN not 3!

My Problem I have Ids and names columns, I want to build a network graph where I connect two Ids with an edge and the weight of the edge is defined if the two Ids(nodes) share more than one name.

Upvotes: 1

Views: 61

Answers (1)

jezrael
jezrael

Reputation: 863611

Use networkx with connected_components:

import networkx as nx

# Create the graph from the dataframe
g = nx.Graph()


g.add_edges_from(df[['name','Id']].itertuples(index=False))

connected_components = nx.connected_components(g)

#create dict for common values
node2id = {}
for cid, component in enumerate(connected_components):
    for node in component:
        node2id[node] = cid + 1

df['UniqueID'] = df['Id'].map(node2id)
print (df)

0  ABC  i1         1
1  ABC  i2         1
2  PQR  i3         2
3  XYZ  i1         1
4  XYZ  i2         1

Then aggregate DataFrameGroupBy.nunique and SeriesGroupBy.nunique ad create new DataFrame:

df = df.groupby('UniqueID').agg({'name':'nunique', 'Id':'unique'})
print (df)
          name        Id
UniqueID                
1            2  [i1, i2]
2            1      [i3]

df1 = pd.DataFrame(df.pop('Id').tolist()).rename(columns = lambda x: f'Id{x+1}')
df1['count'] = df['name'].mask(df['name'].lt(2)).to_numpy()
print (df1)
  Id1   Id2  count
0  i1    i2    2.0
1  i3  None    NaN

Upvotes: 4

Related Questions