LuamCT
LuamCT

Reputation: 21

Neo4j: Fast query for getting relationships between a set of nodes

I'm looking for a fast Cypher statement that returns all relationships between a known set of nodes (I have their Neo4j ID's), so that I can assemble the subgraph for that particular set of nodes. I'm working within a label called label which has around 50K nodes and 800K edges between these nodes.

I have several working approaches for this, but none are fast enough for my application, even at small set sizes (less than 1000 nodes).

For example, the following statement does the trick:

MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids}) AND (ID(v) IN {ids}) 
RETURN collect(r)

Where {ids} is a list of numeric Neo4j ids given as parameter to the Py2Neo cypher.execute(statement, parameters) method. The problem is that it takes around 34 seconds for a set of 838 nodes, which returns all 19K relationships between them. I realize the graph is kind of dense, but it takes 1.76 seconds for every 1000 edges returned. I just don't think that's acceptable.

If I use the START clause instead (shown below), the time is actually a little worse.

START u=node({ids}), v=node({ids})
MATCH (u:label)-[r]->(v:label)
RETURN collect(r) 

I've found many similar questions/answers, however they all fall short in some aspect. Is there a better statement for doing this, or even a better graph schema, so that it can scale to sets of thousands of nodes?


UPDATE

Thanks for the fast replies. First, to run my current query for 528 nodes as input (len(ids)=528) it takes 32.1 seconds and the query plan is below.

NodeByIdSeek: 528 hits
Filter      : 528 hits
Expand(All) : 73,773 hits
Filter      : 73,245 hits
Projection  : 0 hits
Filter      : 0 hits

Brian Underwood's query, with the same input, takes 27.8 seconds. The query plan is identical, except for the last 2 steps (Projection and Filter), which don't exist on for his query. However the db hits sum is the same.

Michael Hunger's query takes 26.9 seconds and the query plan is identical to Brian's query.

I've restarted the server between experiments to avoid cache effects (there's probably a smarter way to do it). I'm also querying straight from the web interface to by pass possible bottlenecks in my code and the libs I'm using.

Bottomline, Neo4j seems smart enough to optimize my query, however it's still pretty slow even with fairly small sets. Any suggestions?

Upvotes: 2

Views: 1056

Answers (2)

Michael Hunger
Michael Hunger

Reputation: 41706

How big are your id-lists?

Try this:

MATCH (u) WHERE ID(u) IN {ids}
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN count(*)


MATCH (u) WHERE (ID(u) IN {ids})
WITH u
MATCH (v)-[r]->(v)
WHERE ID(v) IN {ids}
RETURN r

Also try to create a query plan by prefixing your query with PROFILE then you see where the cost is.

Upvotes: 0

Brian Underwood
Brian Underwood

Reputation: 10856

I think the problem is that the query is doing a Cartesian product to get all combinations of the 838 node, so you end up searching 838*838=702,244 combinations.

I'm curious how this would perform:

MATCH (u:label)-[r]->(v:label)
WHERE (ID(u) IN {ids})
WITH r, v
WHERE (ID(v) IN {ids})
RETURN collect(r)

Also, why do the collect at the end?

Upvotes: 1

Related Questions