Reputation: 631
I am receiving data in the following format:
tail head
P01106 Q09472
P01106 Q13309
P62136 Q13616
P11831 P18146
P13569 P20823
P20823 P01100
...
Is there a good way to format this data as a graph with a numpy array? I am hoping to compute PageRank using this graph.
So far I have
import numpy as np
data = np.genfromtxt('wnt_edges.txt', skip_header=1, dtype=str)
I was thinking about using the graph data structure from Representing graphs (data structure) in Python but it didn't seem to make sense in this case since I'll be doing matrix multiplication.
Upvotes: 5
Views: 7390
Reputation: 23637
To avoid reinventing the wheel you should use networkx as suggested in comments and other answers.
If, for educational purposes, you want to reinvent the wheel you can create an adjacency matrix. The PageRank can be computed from that matrix:
The PageRank values are the entries of the dominant right eigenvector of the modified adjacency matrix.
Since each row/column of the adjacency matrix represents a node, you will need to enumerate the nodes so each node is represented by a unique number starting from 0.
import numpy as np
data = np.array([['P01106', 'Q09472'],
['P01106', 'Q13309'],
['P62136', 'Q13616'],
['P11831', 'P18146'],
['P13569', 'P20823'],
['P20823', 'P01100']])
nodes = np.unique(data) # mapping node name --> index
noidx = {n: i for i, n in enumerate(nodes)} # mapping node index --> name
n = nodes.size # number of nodes
numdata = np.vectorize(noidx.get)(data) # replace node id by node index
A = np.zeros((n, n))
for tail, head in numdata:
A[tail, head] = 1
#A[head, tail] = 1 # add this line for undirected graph
This results in the following graph representation A
:
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
A 1 in row 5, column 0 for example means that there is an edge from node 5 to node 0, which corresponds to 'P20823'
--> 'P01100'
. Use the nodes
array to look up node names from indices:
print(nodes)
['P01100' 'P01106' 'P11831' 'P13569' 'P18146' 'P20823' 'P62136' 'Q09472'
'Q13309' 'Q13616']
If there are many nodes and few connections it's better to use a sparse matrix
for A
. But first try to stay with a dense matrix and only bother to switch to sparse of you have memory or performance issues.
Upvotes: 9
Reputation: 3238
I strongly suggest networkx
:
import networkx as nx
#make the graph
G = nx.Graph([e for e in data])
#compute the pagerank
nx.pagerank(G)
# output:
# {'P01100': 0.0770275315329843, 'P01106': 0.14594493693403143,
# 'P11831': 0.1, 'P13569': 0.0770275315329843, 'P18146': 0.1,
# 'P20823': 0.1459449369340315, 'P62136': 0.1, 'Q09472':
# 0.07702753153298428, 'Q13309': 0.07702753153298428, 'Q13616': 0.1}
That's all it takes. pagerank documentation is here.
Upvotes: 3