Reputation: 53
I have a spreadsheet with lists of names of people that a particular person reported working with on a number of projects. If I import it to pandas as a dataframe it will look like this:
1 2
Jane ['Fred', 'Joe'] ['Joe', 'Fred', 'Bob']
Fred ['Alex'] ['Jane']
Terry NaN ['Bob']
Bob ['Joe'] ['Jane', 'Terry']
Alex ['Fred'] NaN
Joe ['Jane'] ['Jane']
I want to create an adjacency matrix that will look like this:
Jane Fred Terry Bob Alex Joe
Jane 0 2 0 1 0 2
Fred 1 0 0 0 1 0
Terry 0 0 0 1 0 0
Bob 1 0 1 0 0 1
Alex 0 1 0 0 0 0
Joe 2 0 0 0 0 0
This matrix, generally, will NOT be symmetric because of inconsistency with people's reports. I have been creating the adjacency matrix just by looping through the dataframe and incrementing the the matrix elements accordingly. Apparently, looping through dataframes is NOT recommended and inefficient, so does anyone have a suggestion on how his could be done more pythonically?
Upvotes: 1
Views: 402
Reputation: 546
This is the sample of the data I used to work with.
df = pd.DataFrame({
'Name': ['Jane', 'Fred', 'Terry', 'Bob', 'Alex', 'Joe'],
'1':[['Fred', 'Joe'], ['Alex'], np.nan,['Joe'], ['Fred'], ['Jane']],
'2': [['Joe', 'Fred', 'Bob'], ['Jane'], ['Bob'], ['Jane', 'Terry'], np.nan, ['Jane']]
})
df.head()
Name 1 2
0 Jane [Fred, Joe] [Joe, Fred, Bob]
1 Fred [Alex] [Jane]
2 Terry NaN [Bob]
3 Bob [Joe] [Jane, Terry]
4 Alex [Fred] NaN
I created the adjacency matrix using pandas in three simple steps.
First, I melted the data to have one column only for all the connections between the different names and dropped the variable column.
dff = df.melt(id_vars=['Name']).drop('variable', axis=1)
Name value
0 Jane [Fred, Joe]
1 Fred [Alex]
2 Terry NaN
3 Bob [Joe]
4 Alex [Fred]
5 Joe [Jane]
6 Jane [Joe, Fred, Bob]
7 Fred [Jane]
8 Terry [Bob]
9 Bob [Jane, Terry]
10 Alex NaN
11 Joe [Jane]
Secondly, I used the explode method to break down the rows with lists in separate rows.
dff = dff.explode('value')
Name value
0 Jane Fred
0 Jane Joe
1 Fred Alex
2 Terry NaN
3 Bob Joe
4 Alex Fred
5 Joe Jane
6 Jane Joe
6 Jane Fred
6 Jane Bob
7 Fred Jane
8 Terry Bob
9 Bob Jane
9 Bob Terry
10 Alex NaN
11 Joe Jane
Finally, to create the adjacency matrix I used crosstab within pandas which counts the occurrences in the two columns specified only.
pd.crosstab(dff['Name'], dff['value'])
value Alex Bob Fred Jane Joe Terry
Name
Alex 0 0 1 0 0 0
Bob 0 0 0 1 1 1
Fred 1 0 0 1 0 0
Jane 0 1 2 0 2 0
Joe 0 0 0 2 0 0
Terry 0 1 0 0 0 0
Upvotes: 3
Reputation: 9619
Here is one approach:
import pandas as pd
import ast
data = ''' 1 2
Jane ['Fred', 'Joe'] ['Joe', 'Fred', 'Bob']
Fred ['Alex'] ['Jane']
Terry NaN ['Bob']
Bob ['Joe'] ['Jane', 'Terry']
Alex ['Fred'] NaN
Joe ['Jane'] ['Jane']'''
df = pd.read_csv(io.StringIO(data), sep='\s\s+', engine='python').fillna('[]').applymap(ast.literal_eval) #if your columns are already lists rather than string representations, use .fillna([]) and skip the applymap
df['all'] = df['1']+df['2'] #merge lists of columns 1 and 2
df_edges = df[['all']].explode('all').reset_index() #create new df by exploding the combined list
df_edges = df_edges.groupby(['index', 'all'])['all'].count().reset_index(name="count") #groupby and count the pairs
df_edges.pivot(index='index', columns='all', values='count').fillna(0) #create adjacency matrix with pivot
Output:
index | Alex | Bob | Fred | Jane | Joe | Terry |
---|---|---|---|---|---|---|
Alex | 0 | 0 | 1 | 0 | 0 | 0 |
Bob | 0 | 0 | 0 | 1 | 1 | 1 |
Fred | 1 | 0 | 0 | 1 | 0 | 0 |
Jane | 0 | 1 | 2 | 0 | 2 | 0 |
Joe | 0 | 0 | 0 | 2 | 0 | 0 |
Terry | 0 | 1 | 0 | 0 | 0 | 0 |
Upvotes: 1