Simon
Simon

Reputation: 631

Representing Graph with Numpy Array

I am receiving data in the following format:

tail head
P01106  Q09472
P01106  Q13309
P62136  Q13616
P11831  P18146
P13569  P20823
P20823  P01100
...

Is there a good way to format this data as a graph with a numpy array? I am hoping to compute PageRank using this graph.

So far I have

import numpy as np
data = np.genfromtxt('wnt_edges.txt', skip_header=1, dtype=str)

I was thinking about using the graph data structure from Representing graphs (data structure) in Python but it didn't seem to make sense in this case since I'll be doing matrix multiplication.

Upvotes: 5

Views: 7390

Answers (2)

MB-F
MB-F

Reputation: 23637

To avoid reinventing the wheel you should use networkx as suggested in comments and other answers.

If, for educational purposes, you want to reinvent the wheel you can create an adjacency matrix. The PageRank can be computed from that matrix:

The PageRank values are the entries of the dominant right eigenvector of the modified adjacency matrix.

Since each row/column of the adjacency matrix represents a node, you will need to enumerate the nodes so each node is represented by a unique number starting from 0.

import numpy as np

data = np.array([['P01106', 'Q09472'],
                 ['P01106', 'Q13309'],
                 ['P62136', 'Q13616'],
                 ['P11831', 'P18146'],
                 ['P13569', 'P20823'],
                 ['P20823', 'P01100']])


nodes = np.unique(data)  # mapping node name --> index
noidx = {n: i for i, n in enumerate(nodes)}  # mapping node index --> name

n = nodes.size  # number of nodes

numdata = np.vectorize(noidx.get)(data)  # replace node id by node index

A = np.zeros((n, n))
for tail, head in numdata:
    A[tail, head] = 1
    #A[head, tail] = 1  # add this line for undirected graph

This results in the following graph representation A:

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

A 1 in row 5, column 0 for example means that there is an edge from node 5 to node 0, which corresponds to 'P20823' --> 'P01100'. Use the nodes array to look up node names from indices:

print(nodes)
['P01100' 'P01106' 'P11831' 'P13569' 'P18146' 'P20823' 'P62136' 'Q09472'
 'Q13309' 'Q13616']

If there are many nodes and few connections it's better to use a sparse matrix for A. But first try to stay with a dense matrix and only bother to switch to sparse of you have memory or performance issues.

Upvotes: 9

Mahdi
Mahdi

Reputation: 3238

I strongly suggest networkx:

import networkx as nx
#make the graph 
G = nx.Graph([e for e in data])
#compute the pagerank 
nx.pagerank(G)
# output:
# {'P01100': 0.0770275315329843,  'P01106': 0.14594493693403143, 
# 'P11831': 0.1,  'P13569': 0.0770275315329843,  'P18146': 0.1, 
# 'P20823': 0.1459449369340315,  'P62136': 0.1,  'Q09472':
# 0.07702753153298428,  'Q13309': 0.07702753153298428,  'Q13616': 0.1}

That's all it takes. pagerank documentation is here.

Upvotes: 3

Related Questions