A_dit_rien
A_dit_rien

Reputation: 297

Count performance with Neo4j using embedded java API

I started testing Neo4j for a program and I am facing some performance issues. As mentioned in the title, Neo4j is directly embedded in the java code.

My graphs contains about 4 millions nodes and several hundred million relationships. My test is simply to send a query counting the number of inbound relationships for a node.

This program uses ExecutionEngine execute procedure to send the following query:

start n=node:node_auto_index(id="United States") match s-[:QUOTES]->n return count(s)

By simply adding some prints I can see how much time this query took which is usually about 900ms which is a lot.

What surprises me the most is that I receive a "query execution time" in the response, which is really different.

For instance a query returned:

+----------+
| count(n) |
+----------+
| 427738   |
+----------+
1 row
1 ms 

According to this response, I undertand that Neo4j took 1ms for the query, but when I print some log messages I can see that it actually took 917ms.

I guess that 1ms is equal to the time required to find the indexed object "United States", which would mean that Neo4j required about 916ms for the rest, like counting the number of relationships. In this case, how can I get getter performances for this query?

Thanks in advance!

Upvotes: 0

Views: 714

Answers (2)

Michael Hunger
Michael Hunger

Reputation: 41706

Make sure not to measure the first query b/c that one only measures how long it takes to load the data from disk into memory.

Make sure to give Neo4j enough memory to cache your data.

And try this query if it is faster.

start n=node:node_auto_index(id="United States") 
return length(()-[:QUOTES]->n) as cnt

Upvotes: 1

Eve Freeman
Eve Freeman

Reputation: 33185

Query timers were broken in 1.8.1 and 1.9.M04, when the cypher lazy stuff was fixed. (Definitely a worthwhile trade for most use cases). But yeah, I think it will be fixed soon.

For now you'll have to time things externally.

Update: As for your question about whether that time is reasonable... It basically needs to scan all ~400k nodes to count them. This is probably reasonable, even if the cache is warmed up and all of those fit into RAM. Having "super nodes" like this is usually not best practice if it can be avoided, although they are going to be making a lot of improvements for this case in future versions (at least, that's what I hear).

Upvotes: 1

Related Questions