user3692521
user3692521

Reputation: 2641

How to run PageRank in Neo4j?

I am using Ubuntu 12.04. Say we have a bunch of nodes and edges loaded in Neo4j how to run algorithm like PageRank? Any packages?

Upvotes: 2

Views: 5181

Answers (6)

hardfork
hardfork

Reputation: 3251

The other answers of 2014 are not up to date anymore.

There is an implemented PageRank algorithm.

CALL algo.pageRank.stream('Site', 'links', {iterations:20, dampingFactor:0.85})
YIELD nodeId, score
RETURN algo.getNodeById(nodeId).name AS page,score
ORDER BY score DESC

'Site' has to be the Label of the sites/pages and 'links' has to be the relationship-type between the sites (which site links with other site).

Upvotes: 2

SztupY
SztupY

Reputation: 10526

You can implement the iterative algorithm in cypher, but you have to run it a few times to get proper results:

/* we have to go through all nodes */
match (node)
with
  collect(distinct node) as pages
unwind pages as dest
  /* let's find all source citations for a given node */
  match (source)-[:LINK]->(dest)
  with
    collect(distinct source) as sources,
    dest as dest
    unwind sources as src
      /* we have to know how many relationships the source node has */
      match (src)-[r:LINK]->()
      with
        src.pageRank / count(r) as points,
        dest as dest
      /* now we have all information to update the destination node with the new pagerank */
      with
        sum(points) as p,
        dest as dest
      set dest.pageRank = 0.15 + 0.85 * p;

To know when to stop, after each iteration you can check the top few values, and if they stop changing massively you can stop the iteration:

MATCH (source) RETURN source.pageRank ORDER BY source.pageRank DESC LIMIT 25;

For me in an interconnected graph of around 10.000 nodes I had to run it 50 times for the iteration to converge. Each run took around half a minute to complete, so it's quite slow, and you might want to check a plugin solution in case you need better performance.

Upvotes: 0

Garret Meier
Garret Meier

Reputation: 41

If you're looking for just a quick way to approximate PageRank over your graph, I've implemented something that approximately works and you could tweak it for your specific data set:

match (a)
set a.pagerank = 0.0
with collect(distinct a) as nodes,count(a) as num_nodes
    /* we need the total number of nodes to modulate our values */
unwind nodes as a
    /* now use all then nodes as separate rows for the next calculation */
match (a)-[r]-(b)
    /* count the total relationships a node has and figure out what slice
    the nodes connected to this one get.*/
    with a,collect(r) as rels, count(r) as num_rels,1.0/num_nodes as rank
        unwind rels as rel
        /* go through all the relationsips and add the rank proportion to the
        destination node */
        set endnode(rel).pagerank =
            case
                when num_rels > 0 and id(startnode(rel)) = id(a) then
                     endnode(rel).pagerank + rank/(num_rels)
                else endnode(rel).pagerank
            end
        ,startnode(rel).pagerank =
            case
                when num_rels > 0 and id(endnode(rel)) = id(a) then
                    startnode(rel).pagerank + rank/(num_rels)
                else startnode(rel).pagerank
            end
with collect(distinct a) as a,rank
return a
/* now we're done updating the pageranks. woohoo! */

Upvotes: 3

m.piunti
m.piunti

Reputation: 368

A fancy approach is to rely on a fast processing platform at web scale, on which massive tasks can be divided in parallel subtasks. The most effective way for fullfilling this vision is using a graph parallel processing engine, as for instance Spark GraphX.

Mazerunner by Kenny Bastani is a brillant tool enhancing Neo4J with parallel graph processing. Subgraph cam thus be exported from Neo4j to Spark, then processed distributelly and then re-imported in the original GraphDB

Give a look at:

http://www.kennybastani.com/2015/01/categorical-pagerank-neo4j-spark.html?m=1

Upvotes: 2

Michal Bachman
Michal Bachman

Reputation: 2661

Have a look at the GraphAware Framework, namely timer-driven runtime modules.

Essentially, the framework allows you to write global graph algorithms (like PageRank) and let them be computed continuously on your graph database. The computation slows down when your database is busy with normal transactional processing and speeds up during quiet periods.

We're working on our own PageRank Module; it is still work in progress, but might be helpful in getting you started.

DISCLAIMER: I'm one of the framework authors.

Upvotes: 3

Fernando Ferreira
Fernando Ferreira

Reputation: 808

I am looking around the web for the same reason.

As far I understood there is some ways to implemented it:

1) Write your own Neo4J plugin in Java:

By default you can get some algorithms (http://docs.neo4j.org/chunked/stable/rest-api-graph-algos.html), but any advanced option it is probably best to write it. Here the reference: http://docs.neo4j.org/chunked/stable/server-plugins.html

2) Use the Gremlin plugin together to Blueprints/Furnace.

This is exactly what I am trying now (more exactly the community detection algorithm). Give a try to https://github.com/tinkerpop/gremlin/wiki and https://github.com/tinkerpop/furnace . It is supposed to work through the Gremlin plugin in your Neo4j server or through some API. I'm a python fan, so I am trying bulbs or pyblueprints. I can't tell you what's is better right now.

3) Query your graph and load it to a known framework

There is thousands of ways to implement it using Python, C, R ... By instance, I would recommend using networkx (python) or igraph (R). As an example: https://networkx.github.io/documentation/latest/reference/generated/networkx.algorithms.link_analysis.pagerank_alg.pagerank.html

Hope it helped.

Upvotes: 4

Related Questions