Lior Goldemberg
Lior Goldemberg

Reputation: 876

neo4j query low performance when looking for path

i have the following Cypher query which run pretty fast (3 seconds)

  MATCH (step1:Hit),(step2:Hit),(step3:Hit),(step4:Hit),
step1-[:VISITED]->(_page1:Page), step1-[:SOURCE_COUNTRY]->(_country1:Country), step1-[:USED_DEVICE]->(_device1:Device),
step2-[:VISITED]->(_page2:Page), step2-[:SOURCE_COUNTRY]->(_country2:Country), step2-[:USED_DEVICE]->(_device2:Device)  ,
step3-[:VISITED]->(_page3:Page), step3-[:SOURCE_COUNTRY]->(_country3:Country), step3-[:USED_DEVICE]->(_device3:Device),
step4-[:VISITED]->(_page4:Page), step4-[:SOURCE_COUNTRY]->(_country4:Country), step4-[:USED_DEVICE]->(_device4:Device) 

WHERE _page1.page_key =~ '(?i)(.*lnd\\.my-domain.com.*)' and step1.date_time>=1432296000 and NOT(()-[:NEXT*]->step1)
and _page2.page_key =~ '(?i)(.*register.*)' and step2.date_time>=1432296000 
and  _page3.page_key =~ '(?i)(.*customer-info.*)' and step3.date_time>=1432296000 
and _page4.page_key =~ '(?i)(.*deposit.*)' and step4.date_time>=1432296000
return step1 limit 5

once i added a path (last 2 lines) variable, it runs 5 minuts :(

MATCH (step1:Hit),(step2:Hit),(step3:Hit),(step4:Hit),
step1-[:VISITED]->(_page1:Page), step1-[:SOURCE_COUNTRY]->(_country1:Country), step1-[:USED_DEVICE]->(_device1:Device),
step2-[:VISITED]->(_page2:Page), step2-[:SOURCE_COUNTRY]->(_country2:Country), step2-[:USED_DEVICE]->(_device2:Device),
step3-[:VISITED]->(_page3:Page), step3-[:SOURCE_COUNTRY]->(_country1:Country), step3-[:USED_DEVICE]->(_device3:Device),
step4-[:VISITED]->(_page4:Page), step4-[:SOURCE_COUNTRY]->(_country4:Country), step4-[:USED_DEVICE]->(_device4:Device) 

WHERE _page1.page_key =~ '(?i)(.*lnd\\.my-domain.com.*)' and step1.date_time>=1432296000 and NOT(()-[:NEXT*]->step1)
and _page2.page_key =~ '(?i)(.*register.*)' and step2.date_time>=1432296000 
and  _page3.page_key =~ '(?i)(.*customer-info.*)' and step3.date_time>=1432296000 
and _page4.page_key =~ '(?i)(.*deposit.*)' and step4.date_time>=1432296000

MATCH path=step1-[:NEXT*..2]->step2-[:NEXT*..2]->step3-[:NEXT*..2]->step4
return path limit 5

The (pseudo) structure of the graph is:

(User) has multiple (Session {session_id})

(session) has 1 or more (Hit {date_time,hit_id})

(Hit) has a single relationship to all of the following:

(Browser),(Country),(Device),(Page {page_key})

also each Hit has a (0 or more) relationship: [NEXT] to the next Hit in the same session

Upvotes: 0

Views: 44

Answers (1)

Michael Hunger
Michael Hunger

Reputation: 41676

the problem you run into is not the path search, but the way of finding your start- and end-nodes.

Another thing is that you're creating a huge cardinality explosion at the beginning of your query.

Try to run a "PROFILE " execution of your query in your Neo4j browser then you see that a load of time is used scanning all your data time and again.

You are also not using Country and Devices at all? Not sure what you need them for?

I recommend for now to use exact lookups for page-search and only after that query is really fast, look into full-text search. Exact lookups would be to look for page-names with an INDEX ON :Page(page_key) and do page.page_key = {url} or page.page_key IN {urls}

For efficient fulltext search, right now you'll have to use a manual index.

But soonish with Neo4j 2.3 you will be able to use LIKE which will be index supported.

Similar for you time-stamp ranges, there you could alternatively also use a time-tree.

Depending on which of the two selection criteria is more selective I'd look into either of those.

All your paths are also disconnected, you should try connect them one step after another. Right now they are not related to each other at all.

This NOT(()-[:NEXT*]->step1) should be NOT(()-[:NEXT]->step1)

Upvotes: 1

Related Questions