Chris Tembreull
Chris Tembreull

Reputation: 13

Extremely slow weighted-edge query in Neo4j

I have a graph with about 1.2 million nodes, and roughly 3.85 million relationships between them. I need to find the top x weighted edges - that is to say, the top x tuples of a, b, n where a and b are unique pairs of vertices and n is the number of connections between them. I am currently getting this data via the following Cypher query:

MATCH (n)-[r]->(x)
WITH n, x, count(r) as weight 
ORDER BY weight DESC 
LIMIT 50
RETURN n.screen_name, r.screen_name, weight;

This query takes about 25 seconds to run, which is far too slow for my needs. I ran the query with the profiler, which returned the following:

ColumnFilter(0)
  |
  +Extract
    |
    +ColumnFilter(1)
      |
      +Top
        |
        +EagerAggregation
          |
          +TraversalMatcher

+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
|         Operator |    Rows |  DbHits | Identifiers |                                                                                          Other |
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+
|  ColumnFilter(0) |      50 |       0 |             |                                                      keep columns n.twitter, x.twitter, weight |
|          Extract |      50 |     200 |             |                                                                           n.twitter, x.twitter |
|  ColumnFilter(1) |      50 |       0 |             |                                                                      keep columns n, x, weight |
|              Top |      50 |       0 |             | {  AUTOINT0}; Cached(  INTERNAL_AGGREGATE01a74d75-74df-42f8-adc9-9a58163257d4 of type Integer) |
| EagerAggregation | 3292734 |       0 |             |                                                                                           n, x |
| TraversalMatcher | 3843717 | 6245164 |             |                                                                                        x, r, x |
+------------------+---------+---------+-------------+------------------------------------------------------------------------------------------------+

My questions, then, are these:

1. Are my expectations way off, and is this the sort of thing that's just going to be slow? It's functionally a map/reduce problem, but this isn't that large of a data set - and it's just test data. The real thing will have a lot more (but be filterable by relationship properties; I'm working on a representative sample).

2. What the heck can I do to make this run faster? I've considered using a start statement but that doesn't seem to help. Actually, it seems to make it worse.

3. What don't I know here, and where can I go to find out that I don't know it?

Thanks,

Chris

Upvotes: 1

Views: 152

Answers (1)

Kenny Bastani
Kenny Bastani

Reputation: 3308

You're hitting the database 6,245,164 in the first statement: MATCH (n)-[r]->(x).

It looks like what you're attempting to do is a graph global query -- that is, you're trying to run query over the entire graph. By doing this you're not taking advantage of indexing that reduces the number of hits to the database.

In order to do this at the level of performance you're needing, an unmanaged extension may be required.

http://docs.neo4j.org/chunked/stable/server-unmanaged-extensions.html

Also, a good community resource to learn about unmanaged extensions: http://www.maxdemarzi.com/

The approach here would be to create a REST API method that extends the Neo4j server. This requires some experience programming with Java.

Alternatively you may be able to run an iterative Cypher query that updates an aggregate count on each [r] between (n) and (x).

Upvotes: 1

Related Questions