Reputation: 15
I am new to cypher and have the below query to find mistmaches between 2 source types(for example). I believe syntactically the query looks fine but it takes 1 minute to run on data set of just 1,00,000 nodes. I am not using relations still. Can someone please help in optimizing the query? Thanks.
MATCH (VW_OXSS41:VW_OrderXStatusSummary4{SourceTypeID: "1"})
WHERE apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS41.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH VW_OXSS41.IdentifierValue as X
MATCH (VW_OXSS42:VW_OrderXStatusSummary4{SourceTypeID: "2"})
WHERE apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))>=apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AND apoc.date.parse(VW_OXSS42.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss'))<=apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd'))
WITH apoc.coll.disjunction(COLLECT(X), COLLECT(VW_OXSS42.IdentifierValue)) as XX
UNWIND (XX) as YY
The updated query and the error:-
WITH apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.subtract(X,Y) AS XX
MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
RETURN XX AS MISMATCHES,MAX(z.TimeStamp);
Variable `a` not defined (line 10, column 7 (offset: 551))
"WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b"
Solved the above error like this:-
WITH apoc.date.parse("2020-02-21",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.subtract(X,Y) AS XX
WITH XX, apoc.date.parse("2020-02-20",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-25",'s',('yyyy-MM-dd')) AS b
MATCH (z:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(z.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
AND XX = z.IdentifierValue
RETURN XX AS MISMATCHES,MAX(z.TimeStamp);
With the correct expected output as:-
+---------------------------------------------+
| MISMATCHES | TIMESTAMP |
+---------------------------------------------+
| "W2002201453550218" | "2020-02-21 12:00:16" |
| "W2002201453550222" | "2020-02-21 12:00:16" |
| "W2002201453550223" | "2020-02-21 09:30:36" |
| "W2002201453550224" | "2020-02-21 12:00:16" |
| "W2002201453550226" | "2020-02-21 12:00:16" |
| "W2002201453550227" | "2020-02-21 12:00:16" |
| "W2002201453550237" | "2020-02-21 12:00:16" |
| "3011WOS002978598" | "2020-02-21 10:00:54" |
| "3011WOS002978595" | "2020-02-21 13:00:57" |
| "0010000000006183" | "2020-02-21 16:00:41" |
| "W2002181111547439" | "2020-02-21 04:00:34" |
| "11" | "2020-02-21 16:00:41" |
| "10112787861P1458" | "2020-02-21 10:00:54" |
+---------------------------------------------+
Wondering if there's a better approach?
Upvotes: 0
Views: 131
Reputation: 66967
You need to avoid making a cartesian product between the results of your two MATCH
clauses. Let's say the two MATCH
clauses would normally return N
and M
nodes, respectively, when executed in their own queries. Because your query combines those two MATCH
clauses in the way that it does, your second MATCH
clause is actually performing N*M
matches (and producing N*M
result rows).
You need to make sure you have created an index on :VW_OrderXStatusSummary4(SourceTypeID)
. That will optimize the lookups performed by the MATCH
clauses.
You can simplify your Cypher code to avoid duplicated function calls.
After creating the index indicated above, try this:
WITH apoc.date.parse("2020-02-10",'s',('yyyy-MM-dd')) AS a, apoc.date.parse("2020-02-16",'s',('yyyy-MM-dd')) AS b
MATCH (x:VW_OrderXStatusSummary4 {SourceTypeID: "1"})
WHERE a <= apoc.date.parse(x.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH a, b, COLLECT(x.IdentifierValue) AS X
MATCH (y:VW_OrderXStatusSummary4 {SourceTypeID: "2"})
WHERE a <= apoc.date.parse(y.TimeStamp,'s',('yyyy-MM-dd HH:mm:ss')) <= b
WITH X, COLLECT(y.IdentifierValue) AS Y
UNWIND apoc.coll.disjunction(X, Y) AS YY
...
Performing the COLLECT(x.IdentifierValue)
operation in the first WITH
clause causes it to return all the x
nodes in a single result row (instead of N
result rows). This allows the second MATCH
to avoid a cartesian product issue.
Upvotes: 1