natec
natec

Reputation: 53

Pythonically create adjacency matrix from spreadsheet

I have a spreadsheet with lists of names of people that a particular person reported working with on a number of projects. If I import it to pandas as a dataframe it will look like this:

       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']

I want to create an adjacency matrix that will look like this:

      Jane  Fred  Terry  Bob  Alex  Joe
Jane  0     2     0      1    0     2
Fred  1     0     0      0    1     0
Terry 0     0     0      1    0     0
Bob   1     0     1      0    0     1
Alex  0     1     0      0    0     0
Joe   2     0     0      0    0     0

This matrix, generally, will NOT be symmetric because of inconsistency with people's reports. I have been creating the adjacency matrix just by looping through the dataframe and incrementing the the matrix elements accordingly. Apparently, looping through dataframes is NOT recommended and inefficient, so does anyone have a suggestion on how his could be done more pythonically?

Upvotes: 1

Views: 402

Answers (2)

Ibrahim Sherif
Ibrahim Sherif

Reputation: 546

This is the sample of the data I used to work with.

df = pd.DataFrame({
    'Name': ['Jane', 'Fred', 'Terry', 'Bob', 'Alex', 'Joe'],
    '1':[['Fred', 'Joe'], ['Alex'], np.nan,['Joe'], ['Fred'], ['Jane']],
    '2': [['Joe', 'Fred', 'Bob'], ['Jane'], ['Bob'], ['Jane', 'Terry'], np.nan, ['Jane']]
})

df.head()
    Name            1                 2
0   Jane  [Fred, Joe]  [Joe, Fred, Bob]
1   Fred       [Alex]            [Jane]
2  Terry          NaN             [Bob]
3    Bob        [Joe]     [Jane, Terry]
4   Alex       [Fred]               NaN

I created the adjacency matrix using pandas in three simple steps.

First, I melted the data to have one column only for all the connections between the different names and dropped the variable column.

dff = df.melt(id_vars=['Name']).drop('variable', axis=1)
     Name             value
0    Jane       [Fred, Joe]
1    Fred            [Alex]
2   Terry               NaN
3     Bob             [Joe]
4    Alex            [Fred]
5     Joe            [Jane]
6    Jane  [Joe, Fred, Bob]
7    Fred            [Jane]
8   Terry             [Bob]
9     Bob     [Jane, Terry]
10   Alex               NaN
11    Joe            [Jane]

Secondly, I used the explode method to break down the rows with lists in separate rows.

dff = dff.explode('value')
     Name  value
0    Jane   Fred
0    Jane    Joe
1    Fred   Alex
2   Terry    NaN
3     Bob    Joe
4    Alex   Fred
5     Joe   Jane
6    Jane    Joe
6    Jane   Fred
6    Jane    Bob
7    Fred   Jane
8   Terry    Bob
9     Bob   Jane
9     Bob  Terry
10   Alex    NaN
11    Joe   Jane

Finally, to create the adjacency matrix I used crosstab within pandas which counts the occurrences in the two columns specified only.

pd.crosstab(dff['Name'], dff['value'])
value  Alex  Bob  Fred  Jane  Joe  Terry
Name                                    
Alex      0    0     1     0    0      0
Bob       0    0     0     1    1      1
Fred      1    0     0     1    0      0
Jane      0    1     2     0    2      0
Joe       0    0     0     2    0      0
Terry     0    1     0     0    0      0

Upvotes: 3

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

Here is one approach:

import pandas as pd
import ast

data = '''       1                  2
Jane   ['Fred', 'Joe']    ['Joe', 'Fred', 'Bob']
Fred   ['Alex']           ['Jane']
Terry  NaN                ['Bob']
Bob    ['Joe']            ['Jane', 'Terry']
Alex   ['Fred']           NaN
Joe    ['Jane']           ['Jane']'''

df = pd.read_csv(io.StringIO(data), sep='\s\s+', engine='python').fillna('[]').applymap(ast.literal_eval) #if your columns are already lists rather than string representations, use .fillna([]) and skip the applymap
df['all'] = df['1']+df['2'] #merge lists of columns 1 and 2

df_edges = df[['all']].explode('all').reset_index() #create new df by exploding the combined list
df_edges = df_edges.groupby(['index', 'all'])['all'].count().reset_index(name="count") #groupby and count the pairs

df_edges.pivot(index='index', columns='all', values='count').fillna(0) #create adjacency matrix with pivot

Output:

index Alex Bob Fred Jane Joe Terry
Alex 0 0 1 0 0 0
Bob 0 0 0 1 1 1
Fred 1 0 0 1 0 0
Jane 0 1 2 0 2 0
Joe 0 0 0 2 0 0
Terry 0 1 0 0 0 0

Upvotes: 1

Related Questions