Query optimization for matching nodes with equal values

Question

I want to collect all nodes that have the same property value

MATCH (rhs:DataValue),(lhs:DataValue) WHERE rhs.value = lhs.value RETURN rhs,lhs

I have created an index on the property

CREATE INDEX ON :DataValue(value)

the index is created:

Indexes
    ON :DataValue(value)          ONLINE

I have only 2570 DataValues.

match (n:DataValue) return count(n)
> 2570

However, the query takes ages/does not terminate within the timeout of my browser.

This surprises me as I have an index and expected the query to run within O(n) with n being the amount of nodes.

My train of thought is: If I'd implement it myself I could just match all nodes O(n) sort them by value O(n log n) and then go through the sorted list and return all sublists that are longer than 1 O(n). Thus, the time I could archive is O(n log n). However, I expect the sorting already being covered by the indexing.

How am I mistaken and how can I optimize this query?

InverseFalcon · Accepted Answer

Your complexity is actually O(n^2), since your match creates a cartesian product for rhs and lhs, and then does filtering for every single pairing to see if they are equal. The index doesn't apply in your query at all. You should be able to confirm that by running EXPLAIN or PROFILE on the query.

You'll want to tweak your query a little to get it to O(n). Hopefully in a future neo4j version query planning will be smarter so we don't have to be so explicit.

MATCH (lhs:DataValue)
WITH lhs
MATCH (rhs:DataValue)
WHERE rhs.value = lhs.value 
RETURN rhs,lhs

Note that your returned values will include opposite pairs ((A, B), (B, A)).

Query optimization for matching nodes with equal values

Answers (1)

Related Questions