Melsauce
Melsauce

Reputation: 2655

How to create an edge list from pandas dataframe?

I have a pandas dataframe (df) of the form-

    Col1
A  [Green,Red,Purple]
B  [Red, Yellow, Blue]
C  [Brown, Green, Yellow, Blue]

I need to convert this to an edge list i.e. a dataframe of the form:

Source    Target    Weight
  A         B         1
  A         C         1
  B         C         2

EDIT Note that the new dataframe has rows equal to the total number of possible pairwise combinations. Also, to compute the 'Weight' column, we simply find the intersection between the two lists. For instance, for B&C, the elements share two colors: Blue and Yellow. Therefore, the 'Weight' for the corresponding row is 2.

What is the fastest way to do this? The original dataframe contains about 28,000 elements.

Upvotes: 5

Views: 9405

Answers (3)

cs95
cs95

Reputation: 402483

First, starting off with the dataframe:

from itertools import combinations

df = pd.DataFrame({
        'Col1': [['Green','Red','Purple'], 
                 ['Red', 'Yellow', 'Blue'], 
                 ['Brown', 'Green', 'Yellow', 'Blue']]
     }, index=['A', 'B', 'C'])

df['Col1'] = df['Col1'].apply(set)    
df

                           Col1
A          {Purple, Red, Green}
B           {Red, Blue, Yellow}
C  {Green, Yellow, Blue, Brown}

Each list in Col1 has been converted into a set to find the union efficiently. Next, we'll use itertools.combinations to create pairwise combinations of all rows in df:

df1 = pd.DataFrame(
    data=list(combinations(df.index.tolist(), 2)), 
    columns=['Src', 'Dst'])

df1

  Src Dst
0   A   B
1   A   C
2   B   C

Now, apply a function to take the union of the sets and find its length. The Src and Dst columns act as a lookup into df.

df1['Weights'] = df1.apply(lambda x: len(
    df.loc[x['Src']]['Col1'].intersection(df.loc[x['Dst']]['Col1'])), axis=1)
df1

  Src Dst  Weights
0   A   B        1
1   A   C        1
2   B   C        2

I advice set conversion at the very beginning. Converting your lists to a set each time on the fly is expensive and wasteful.

For more speedup, you'd probably want to also copy the sets into two columns in the new dataframe because calling df.loc constantly will slow it down a notch.

Upvotes: 7

piRSquared
piRSquared

Reputation: 294258

  • get an array of sets
  • get pairwise indices representing all combinations using np.triu_indices
  • use & operator to get the pairwise intersections and get lengths via a comprehension

c = df.Col1.apply(set).values

i, j = np.triu_indices(c.size, 1)

pd.DataFrame(dict(
        Source=df.index[i],
        Target=df.index[j],
        Weight=[len(s) for s in c[i] & c[j]]
    ))

  Source Target  Weight
0      A      B       1
1      A      C       1
2      B      C       2

Upvotes: 2

BENY
BENY

Reputation: 323226

Try this. Not very neat but work. PS: The final out put you can adjust it , I did not drop columns and change the columns name

import pandas as pd 
df=pd.DataFrame({"Col1":[['Green','Red','Purple'],['Red', 'Yellow', 'Blue'],['Brown', 'Green', 'Yellow', 'Blue']],"two":['A','B','C']})
df=df.set_index('two')
del df.index.name
from itertools import combinations
DF=pd.DataFrame()
dict1=df.T.to_dict('list')
DF=pd.DataFrame(data=[x for x in combinations(df.index, 2)])
DF['0_0']=DF[0].map(df['Col1'])
DF['1_1']=DF[1].map(df['Col1'])
DF['Weight']=DF.apply(lambda x : len(set(x['0_0']).intersection(x['1_1'])),axis=1)



DF
Out[174]: 
   0  1                   0_0                           1_1  Weight
0  A  B  [Green, Red, Purple]           [Red, Yellow, Blue]       1
1  A  C  [Green, Red, Purple]  [Brown, Green, Yellow, Blue]       1
2  B  C   [Red, Yellow, Blue]  [Brown, Green, Yellow, Blue]       2

Upvotes: 5

Related Questions