Reputation: 2641
I am using Ubuntu 12.04. Say we have a bunch of nodes and edges loaded in Neo4j how to run algorithm like PageRank? Any packages?
Upvotes: 2
Views: 5181
Reputation: 3251
The other answers of 2014 are not up to date anymore.
There is an implemented PageRank algorithm.
CALL algo.pageRank.stream('Site', 'links', {iterations:20, dampingFactor:0.85})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).name AS page,score
ORDER BY score DESC
'Site' has to be the Label of the sites/pages and 'links' has to be the relationship-type between the sites (which site links with other site).
Upvotes: 2
Reputation: 10526
You can implement the iterative algorithm in cypher, but you have to run it a few times to get proper results:
/* we have to go through all nodes */
match (node)
with
collect(distinct node) as pages
unwind pages as dest
/* let's find all source citations for a given node */
match (source)-[:LINK]->(dest)
with
collect(distinct source) as sources,
dest as dest
unwind sources as src
/* we have to know how many relationships the source node has */
match (src)-[r:LINK]->()
with
src.pageRank / count(r) as points,
dest as dest
/* now we have all information to update the destination node with the new pagerank */
with
sum(points) as p,
dest as dest
set dest.pageRank = 0.15 + 0.85 * p;
To know when to stop, after each iteration you can check the top few values, and if they stop changing massively you can stop the iteration:
MATCH (source) RETURN source.pageRank ORDER BY source.pageRank DESC LIMIT 25;
For me in an interconnected graph of around 10.000 nodes I had to run it 50 times for the iteration to converge. Each run took around half a minute to complete, so it's quite slow, and you might want to check a plugin solution in case you need better performance.
Upvotes: 0
Reputation: 41
If you're looking for just a quick way to approximate PageRank over your graph, I've implemented something that approximately works and you could tweak it for your specific data set:
match (a)
set a.pagerank = 0.0
with collect(distinct a) as nodes,count(a) as num_nodes
/* we need the total number of nodes to modulate our values */
unwind nodes as a
/* now use all then nodes as separate rows for the next calculation */
match (a)-[r]-(b)
/* count the total relationships a node has and figure out what slice
the nodes connected to this one get.*/
with a,collect(r) as rels, count(r) as num_rels,1.0/num_nodes as rank
unwind rels as rel
/* go through all the relationsips and add the rank proportion to the
destination node */
set endnode(rel).pagerank =
case
when num_rels > 0 and id(startnode(rel)) = id(a) then
endnode(rel).pagerank + rank/(num_rels)
else endnode(rel).pagerank
end
,startnode(rel).pagerank =
case
when num_rels > 0 and id(endnode(rel)) = id(a) then
startnode(rel).pagerank + rank/(num_rels)
else startnode(rel).pagerank
end
with collect(distinct a) as a,rank
return a
/* now we're done updating the pageranks. woohoo! */
Upvotes: 3
Reputation: 368
A fancy approach is to rely on a fast processing platform at web scale, on which massive tasks can be divided in parallel subtasks. The most effective way for fullfilling this vision is using a graph parallel processing engine, as for instance Spark GraphX.
Mazerunner by Kenny Bastani is a brillant tool enhancing Neo4J with parallel graph processing. Subgraph cam thus be exported from Neo4j to Spark, then processed distributelly and then re-imported in the original GraphDB
Give a look at:
http://www.kennybastani.com/2015/01/categorical-pagerank-neo4j-spark.html?m=1
Upvotes: 2
Reputation: 2661
Have a look at the GraphAware Framework, namely timer-driven runtime modules.
Essentially, the framework allows you to write global graph algorithms (like PageRank) and let them be computed continuously on your graph database. The computation slows down when your database is busy with normal transactional processing and speeds up during quiet periods.
We're working on our own PageRank Module; it is still work in progress, but might be helpful in getting you started.
DISCLAIMER: I'm one of the framework authors.
Upvotes: 3
Reputation: 808
I am looking around the web for the same reason.
As far I understood there is some ways to implemented it:
1) Write your own Neo4J plugin in Java:
By default you can get some algorithms (http://docs.neo4j.org/chunked/stable/rest-api-graph-algos.html), but any advanced option it is probably best to write it. Here the reference: http://docs.neo4j.org/chunked/stable/server-plugins.html
2) Use the Gremlin plugin together to Blueprints/Furnace.
This is exactly what I am trying now (more exactly the community detection algorithm). Give a try to https://github.com/tinkerpop/gremlin/wiki and https://github.com/tinkerpop/furnace . It is supposed to work through the Gremlin plugin in your Neo4j server or through some API. I'm a python fan, so I am trying bulbs or pyblueprints. I can't tell you what's is better right now.
3) Query your graph and load it to a known framework
There is thousands of ways to implement it using Python, C, R ... By instance, I would recommend using networkx (python) or igraph (R). As an example: https://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html
Hope it helped.
Upvotes: 4