Why does my cypher query take 10 times longer when I run it with count()?

Question

I start with the following query:

PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN PContains
LIMIT 10

I get "5834 total db hits in 119 ms". The graph correctly shows 9 nodes, and 8 edges connecting them. Then I run an almost-identical query, except that I instead return count(distinct()):

PROFILE
MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(distinct(SPrimePackage))
LIMIT 10

This gives "1382270 total db hits in 1771 ms". The result is correct: 8. However, why is count(distinct()) so much slower and more expensive? Should I be doing this some other way?

I'm running Neo4j 2.3.1

EDIT 1

To ensure I'm comparing apples to apples, and to highlight the question, here is a similar pair of queries and results:

MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN SPrimePackage
LIMIT 10

Note it's returning "SPrimePackage" instead of "PContains" in the original. The result is "5834 total db hits in 740 ms".

Here is that exact same query with "count()":

MATCH Base = (SBase:Snapshot {timestamp:1454983481.304583})-[:contains]->()
MATCH Prime = (:Snapshot {timestamp:1454983521.642284})-[PContains:contains]->(SPrimePackage)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN count(SPrimePackage)
LIMIT 10

The result: "1382270 total db hits in 2731 ms". Note the only difference is the "count()". Intuitively, I would expect "count()" to add a single tallying step, but clearly it's doing much more than that. Why is "count()" triggering all of this extra work?

cybersam · Accepted Answer

[UPDATED]

If you compared the PROFILE output of your 2 (edited) queries, you'd probably see that the only significant difference was the existence of an EagerAggregation operation in the COUNT() version of the query. Aggregation functions use EagerAggregation to collect in memory all the data being aggregated before actually performing the aggregation function (in this case, COUNT()). That requires additional work that is not needed when you do not use the aggregation function.

The following query still uses COUNT() in order to get the count, but greatly reduces the data that has to be aggregated, thus reducing the amount of work that needs to be done in the EagerAggregation step:

PROFILE
MATCH (SBase:Snapshot { timestamp:1454983481.304583 })
USING INDEX SBase:Snapshot(timestamp)
WHERE (SBase)-[:contains]->()
MATCH (s:Snapshot { timestamp:1454983521.642284 })-[:contains]->(SPrimePackage)
USING INDEX s:Snapshot(timestamp)
WHERE NOT (SBase)-[:contains]->(SPrimePackage)
RETURN COUNT(DISTINCT SPrimePackage)
LIMIT 10;

The above query assumes you have already created an index on :Snapshot(timestamp), to greatly speed up the search for the 2 :Snapshot nodes:

CREATE INDEX ON :Snapshot(timestamp);

Using some simple data, the profile I get is:

+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| Operator          | Estimated Rows | Rows | DB Hits | Variables                            | Other                                |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+
| +ProduceResults   |              1 |    1 |       0 | COUNT(DISTINCT SPrimePackage)        | COUNT(DISTINCT SPrimePackage)        |
| |                 +----------------+------+---------+--------------------------------------+--------------------------------------+
| +Limit            |              1 |    1 |       0 | COUNT(DISTINCT SPrimePackage)        | Literal(10)                          |
| |                 +----------------+------+---------+--------------------------------------+--------------------------------------+
| +EagerAggregation |              1 |    1 |       0 | COUNT(DISTINCT SPrimePackage)        |                                      |
| |                 +----------------+------+---------+--------------------------------------+--------------------------------------+
| +AntiSemiApply    |              1 |    7 |       0 | anon[180], s -- SBase, SPrimePackage |                                      |
| |\                +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(Into)   |              1 |    0 |      34 | anon[266] -- SBase, SPrimePackage    | (SBase)-[:contains]->(SPrimePackage) |
| | |               +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument       |              4 |    8 |       0 | SBase, SPrimePackage                 |                                      |
| |                 +----------------+------+---------+--------------------------------------+--------------------------------------+
| +CartesianProduct |              4 |    8 |       0 | SBase -- anon[180], SPrimePackage, s |                                      |
| |\                +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All)    |              4 |    8 |      10 | anon[180], SPrimePackage -- s        | (s)-[:contains]->(SPrimePackage)     |
| | |               +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +NodeIndexSeek  |              2 |    2 |       4 | s                                    | :Snapshot(timestamp)                 |
| |                 +----------------+------+---------+--------------------------------------+--------------------------------------+
| +SemiApply        |              1 |    2 |       0 | SBase                                |                                      |
| |\                +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Expand(All)    |              4 |    0 |       2 | anon[112], anon[126] -- SBase        | (SBase)-[:contains]->()              |
| | |               +----------------+------+---------+--------------------------------------+--------------------------------------+
| | +Argument       |              2 |    2 |       0 | SBase                                |                                      |
| |                 +----------------+------+---------+--------------------------------------+--------------------------------------+
| +NodeIndexSeek    |              2 |    2 |       3 | SBase                                | :Snapshot(timestamp)                 |
+-------------------+----------------+------+---------+--------------------------------------+--------------------------------------+

In addition to using indexing, the above query:

Does not bother to find all nodes contained by SBase, since we need to find just one contained node in order to identify a matching SBase node. The SemiApply operation will complete as soon as a single (SBase)-[:contains]->() match is found, and so the first MATCH clause will result in a single row per SBase instead of N rows. Based on the info in your question, I suspect N would have been about 8.
Has a Cartesian Product that should be pretty fast, since both "legs" of the product should have low cardinality.

Why does my cypher query take 10 times longer when I run it with count()?

Answers (1)

Related Questions