Cypher: Personalized PageRank with multiple source nodes returning duplicates

Question

I have a particular Cypher query that runs personalized PageRank over a set of source nodes. I want to RETURN the top n scoring nodes including their PageRank scores, additional properties, and all relationships between these nodes.

With the help of SO I've gotten to this point:

MATCH (p) WHERE p.paper_id IN $paper_ids
CALL algo.pageRank.stream(null, null, {direction: "BOTH", sourceNodes: [p]})
YIELD nodeId, score
WITH p, nodeId, score ORDER BY score DESC
LIMIT 25

MATCH (n) WHERE id(n) = nodeId
WITH collect(nodeId) as ids, collect(n {.*, score}) as nodes
CALL apoc.algo.cover(ids) YIELD rel
RETURN ids, nodes, collect(rel) as rels

The only problem I'm having is that this query returns duplicate nodes. For example, a node can be returned several times with different PageRank scores. I suspect this is due to having multiple source nodes, so the different scores correspond to the PageRank scores for each source node. There are no duplicates when PageRank is run for a single source node.

This is an issue because I want to RETURN n unique nodes (in the above code block n = 25). In a typical run with two source nodes I get about 21-22 unique nodes.

How would I go about ensuring I RETURN n unique nodes?

cybersam · Accepted Answer

As @Tezra says, you need to determine how you want to resolve multiple scores for the same p.

The following options all involve making a simple change to this clause:

    WITH p, nodeId, score

Options:

Use the maximum score:
```
WITH p, nodeId, MAX(score) AS score
```
Use the minimum score:
```
WITH p, nodeId, MIN(score) AS score
```
Use the average score:
```
WITH p, nodeId, AVG(score) AS score
```

Cypher: Personalized PageRank with multiple source nodes returning duplicates

Answers (1)

Related Questions