talhaocakci
talhaocakci

Reputation: 141

Neo4j Query Optimization for Cartesian Product

I am trying to implement a user-journey analytics solution. Simply analyze on which screens, which users leave the application. For this , I have modeled the data like this: blue circles are events (a single activity that contains user and device data). green circles are event types (activity types such as the name of the screen) I modeled single activity since I want to index some attributes. Relation attributes can not be indexed in Neo4j.

With this model, I am trying to write a query that follows three successive event types with below query:

MATCH (eventType1:EventType {eventName:'viewStart-home'})<--(event:EventNode)
<--(eventType2:EventType{eventName:'viewStart-payment'})  

WITH distinct event.deviceId as eUsers, event.clientCreationDate as eDate

MATCH((eventType2)<--(event2:EventNode)
<--(eventType3:EventType{eventName:'viewStart-screen1'}))

WITH distinct event2.deviceId as e2Users, event2.clientCreationDate as e2Date
RETURN e2Users limit 200000

And the execution plan is below:

last NodeByLabelScan ruins everything

I could not figure the reason of this process out. Can you help me?

Upvotes: 0

Views: 64

Answers (1)

cybersam
cybersam

Reputation: 66967

Your query is doing a lot more work than it needs to.

The first WITH clause is not needed at all, since its generated eUsers and eDate variables are never used. And the second WITH clause does not need to generate the unused e2Date variable.

In addition, you could first add an index for :EventType(eventName) to speed up the processing:

CREATE INDEX ON :EventType(eventName);

With these changes, your query's profile could be simpler and the processing would be faster.

Here is an updated query (that should use the index to quickly find the EventType node at one end of the path, to kick off the query):

MATCH (:EventType {eventName:'viewStart-home'})<--(:EventNode)
  <--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
  <--(:EventType{eventName:'viewStart-screen1'})
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;

Here is an alternate query that uses 2 USING INDEX hints to tell the planner to quickly find the :EventType nodes at both ends of the path to kick off the query. This might be even faster than the first query:

MATCH (a:EventType {eventName:'viewStart-home'})<--(:EventNode)
  <--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
  <--(b:EventType{eventName:'viewStart-screen1'})
USING INDEX a:EventType(eventName)
USING INDEX b:EventType(eventName)
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;

Try profiling them both on your DB, and pick the best one or keep tweaking further.

Upvotes: 1

Related Questions