DK5
DK5

Reputation: 302

Require advice on Neo4j model/Query performance/configurations for optimum performance

I am working on a GraphDB experiment to test if we can shift our relational data from T-SQL to GraphDB (using Neo4j). We are looking to work with large amounts of data which can benefit if we query a graph structure. At the moment we are viewing really low query performance even for some simple where clause and aggregation steps. It would be nice to obtain some advice on how we can achieve better performance since Neo4j claims to work on billions of nodes. Here is what all we have tried.

So, let me describe the data: We have customer data regarding the Countries (Geography) and the Products (SKUs) they have visited/purchased online. Every time a customer comes to the website, his views/purchases are tracked as a part of a unique session id, which changes after 30 minutes. We are trying to accurately calculate the number of visits a person has made to the website by calculating the distinct Session IDs.

We have about 26M rows of data relating to customer visits/purchases made when coming to the website. The data in SQL is in the following format:

----------------------------------------------------------------------------
|    Date|   SessionId|   Geography|   SKU|   OrderId|    Revenue|   Units||
|--------|------------|------------|------|----------|-----------|--------||
|20160101|         111|         USA|     A|      null|          0|       0||
|20160101|         111|         USA|     B|         1|         50|       1||
|20160101|         222|          UK|     A|         2|         10|       1||
----------------------------------------------------------------------------

Problem: We need to calculate the number of visits a customer makes to the site accurately. Visits are calculated as distinct session ids.

Explanation of Visit calculation logic: In the above model if we view the Visits where a person came to the site looking for SKU named "A", our answer would be 2. The first view in session 111 and second in session 222. Similarly if we want to know the number of visits where a person came to the site looking for SKU "A" or "B" then also the answer would be 2. This is so because in session 111, both the products were viewed but the total visits are 1 only. There are 2 product views in session 111, but there is only 1 visit. So counting the other visit from 222, we have a total of 2 visits still.

Here's model we have built: We have a fact node, one for each row present in the data. We have made distinct Geography and Product nodes, 400 and 4000 respectively. Each of these nodes have relationships to multiple facts. Similarly we have distinct nodes for Dates.

We created distinct nodes for Session ids and order ids too. Both of these point to the facts. So basically we have distinct nodes with following properties:

1) Geography  {Locale, Country}
2) SKU {SKU, ProductName}
3) Date {Date}
4) Sessions {SessionIds}
5) Orders {OrderIds}
6) Facts {Locale, Country, SKU, ProductName, Date, SessionIds, OrderIds}

The relationship schema is based on matching property values and looks like :

(:Geography)-[:FactGeo]->(:Facts)
(:SKU)-[:FactSKU]->(:Facts]
(:Date)-[:FactDate]->(:Facts)
(:SessionId)-[:FactSessions]->(:Facts)
(:OrderId)-[:FactOrders]->(:Facts)

Here is a snapshot of the schema: The Model's Schema

As some of you stated that a missing Index might be causing the issue, but there are all the indices which i would need and more. I am assuming that adding the extra indices that i don't query on mostly, would not lower the performance so significantly.

There are a total of 44M nodes, most of them are for Facts and SessionId nodes. There are 131M relationships.

If i try to query for identifying the distinct visits for people belonging to around 20 Countries and for around 20 Products, it takes about 44 seconds to get the answer. It takes SQL about 47 seconds (without indexing) for the same (when i have indices built in Neo4j). This is not the exceptional improvement that i hoped to obtain from using Neo4j, since I think building an index in SQL would give better performance.

The query i wrote was like this:

(geo: Geography)-[:FactGeo]->(fct: Facts)<-(sku: SKU)
WHERE geo.Country IN ["US", "India", "UK"...]
AND sku.SKU IN ["A","B","C".....]
MATCH (ssn: Sessions)-[:FactSessions]->(fct)
RETURN COUNT(DISTINCT ssn.SessionId);

When i use PROFILE, this results in approx 69M db hits: Base Query which uses Session Nodes in the model

Q1) Is there a way I can improve this model to have a better performing query? For example i can change the above model by removing the Session nodes and just counting the SessionIds present on Fact nodes as in the query below:

(geo: Geography)-[:FactGeo]->(fct: Facts)<-(sku: SKU)
WHERE geo.Country IN ["US", "India", "UK"...]
AND sku.SKU IN ["A","B","C".....]
RETURN COUNT(DISTINCT fct.SessionId);

Which happens because of the huge number of nodes and relationships between Facts and Sessions. So it seems that i would rather benefit from having SessionIds as a Property of Facts nodes.

When i use PROFILE, this results in approx 50M db hits: Query on a model where we dont have Session nodes, but which uses SessionId as a property of the Facts in the model

Also, can someone help me understand the tipping point where it becomes difficult to scan nodes on the basis of properties as I increase the number of properties the nodes have?

Q2) Is there something wrong with my Neo4j configurations as it is taking 44 seconds? I have a 114GB ram for the java heap, but no SSD. I have not tweaked around with other configurations and would like to know if those could be the bottleneck over here as I was told that Neo4j could run on billions of nodes?

My Machine's Total RAM: 140GB RAM dedicated to Java heap: 114GB (From what I recollect, there was almost negligible performance increase as i moved from 64GB RAM to 114GB) Page Cache Size: 4GB Approximate GraphDB size: 45GB Neo4j Version i am using: 3.0.4 Enterprise Edition

Q3) Is there any better way to formulate a query which performs better? I tried the following query:

(geo: Geography)-[:FactGeo]->(fct: Facts)
WHERE geo.Country IN ["US", "India", "UK"...]
MATCH (sku: SKU)-[:FactSKU]->(fct)
WHERE sku.SKU IN ["A","B","C".....]
RETURN COUNT(DISTINCT fct.SessionId);

But it gives around the same performance and records the same number of DBhits as the slightly improved query in Q1.

When i use PROFILE, this results in approx 50M db hits, exactly same as the query in Q1: Similar performance as before

Q4) If i modify my query from Q3 to as below, instead of seeing an improvement i see a major decrease in performance:

MATCH (geo: Geography)
WHERE geo.Country IN ["US", "India", "UK"...]
WITH geo
MATCH (sku: SKU)
WHERE sku.SKU IN ["A","B","C".....]
WITH geo, sku
MATCH (geo)-[:FactGeo]->(fct: Facts)<-[:FactSKU]-(sku)
RETURN COUNT(DISTINCT fct.SessionId);

This appears to be creating a cross join between the 400 Geography nodes and the 4000 sku nodes, and then testing each relationship to possible exist between one of those 1,600,000 possible relationship combinations. Am i understanding this correctly?

I know these are long questions and a very long post. But I have tried tirelessly for more than a week to work these things out on my own and i have shared some of my findings over here. Hopefully the community will be able to guide me with some of these queries. Thanks in advance for even reading the post!

EDIT-01: Tore, Inverse and Frank, Thanks a lot for trying to help me out guys, I hope we can figure out the root cause here.

A) I have added some more details, regarding my PROFILE results and also my SCHEMA and Machine/Neo4j config stats.

B) As i consider the model that @InverseFalcon suggested, and try to keep in mind the facts about relationships being a better choice and limiting the number of relationships.

I am tweaking Inverse's model a bit because I think we might be able to reduce it a bit. How is this for as a model:

(:Session)-[:ON]->(:Date)
(:Session)-[:IN]->(:Geography)
(:Session)-[:Viewed]->(:SKU)
(:Session)-[:Bought]->(:SKU)

OR

(:Session)-[:ON]->(:Date)
(:Session)-[:IN {SKU: "A", HasViewedOrBought: 1}]->(:Geography)

Now both of the models can have advantages. In the first one I maintain SKUs as distinct nodes and have different relationships between them to determine if it was a purchase or a view.

In the second one, i completely remove the SKU nodes adding them as relationships. I understand that this will lead to many relationships, but the number of relationships will still be small as we are also discarding all the nodes and relationships of SKU nodes which we are removing. We will have to test the Relationship by comparing the SKU strings, and that is an intensive operation and perhaps could be avoided by keeping only Session and Geography nodes and removing the Date nodes and adding Date property to the SKU relationships. As below:

(:Session)-[:ON]->(:Date)
(:Session)-[:IN {Date: {"2016-01-01"}, SKU: "A", HasViewedOrBought: 1}]->(:Geography)

But then I would be testing the relationships between the Geography and SKU nodes on the basis of two properties, both of which are strings. (Arguably date can be converted to integer, but still i see we have another face-off between alternate models)

C) @Tore, thanks for explaining and confirming my understanding of Q4. But if the GraphDB does a calculation like that, in which it joins and compares every relation with that join, isn't it actually working in the same manner an RDBMS should? It is ineffective in utilizing the graph traversals that it should easily be able to do by finding direct paths between the two set of Geography and Product nodes. This seems to be a bad implementation to me?

Upvotes: 3

Views: 171

Answers (3)

InverseFalcon
InverseFalcon

Reputation: 30397

It seems to me that you're trying to do both graph modeling and RDBMS modeling at the same time, and that at least is adding an additional traversal step in your queries.

While I can't say this will result in a major performance improvement, I would consider removing your :Fact nodes, as they contain redundant information that is already captured in your graph. (assuming that session IDs aren't ever reused)

It's just a matter of wiring up your nodes without a :Fact as the central one tying them together. Sessions and Orders are likely going to be your primary nodes.

So your relationships between your nodes might look like this:

(:Session)-[:From]->(:Geography)
(:Session)-[:Visited]->(:Product)
(:Session)-[:On]->(:Date)
(:Session)-[:Ordered]->(:Order)
(:Order)-[:Of]->(:Product)

We're assuming that since the session time window is small enough, that we can count a session's date as the same as the order or visitation date from that session. If we need something more specific, we can add a relationship between an :Order and a :Date, and add a date property to a :Visited relationship (assuming we don't want to add a :Visit node as an intermediary between a session and a Product).

This changes your query to something like:

MATCH (geo:Geography)<-[:From]-(ssn:Session)-[:Ordered]->(:Order)-[:Of]->(sku:Product)
WHERE geo.Country IN ["US", "India", "UK"...]
AND sku.SKU IN ["A","B","C".....]
RETURN COUNT(DISTINCT ssn);

I'm assuming that :Sessions are unique, with a unique SessionId property, so there should be no need to get the distinct property itself, just use the node.

As Tore noted, indexes and unique constraints are critical here, especially with the size of your data set. Geography.Country, Session.SessionID, Product.SKU, and Order.OrderId should probably all have unique constraints.

Use PROFILE to see where your queries may be running into problems.

And all that said, your use case here probably will not see a significant improvement over a RDBMS, as this kind of data both models well and queries well in a relational db. Are there any questions of your data that you are either unable to get or unable to get quickly in your current db?

EDIT

In response to your edit, it is also helpful to expand (show more detail on) the operations in your PROFILE so you can see not just the operation and the db hits, but also what aspect of your query the operation concerns.

Based upon what we find when we expand those operations, we're likely to see an opportunity to improve the query performance, as I'm guessing there's a massive difference in numbers between those who have bought the products in question, and the total sessions within a country.

One possible area we could improve on is suggesting which index to use in the query, as traversing from the product to the sessions of users who bought it, then to the countries associated with the sessions, ought to be more performant than trying to match from sessions of all users from the given countries.

It's important to note that the benefits of Neo4j shine when you are querying across smaller subgraphs of data, and not the whole data set, or huge chunks of the data set. The subgraphs you are looking at in your example queries are still quite large, looking at purchase histories of users across entire countries. Those kind of queries are best done with RDBMs, and at that scale you are doing millions of graph traversals, which aren't trivial...to find connections between Geography and Product nodes it still must perform those traversals and use set operations to filter only those that connect. I would imagine, though, when asking queries about data at this scale (products bought by users across many different countries), that this is more of an analysis operation, not an operation servicing a user in realtime, so I'm wondering if performance concerns are critical for these kind of queries.

You would start to see performance improvements as your queried subgraph shrinks. You may start to see this if your queries narrowed down the countries queried.

Even better if you're asking about individual user purchase history, as then your queried subgraph is local to a user. But then you're modeling that perfectly fine in an RDBMs already, since your row of all the data you need is in a single table.

Remember that Neo4j's strength is in doing traversals vs joins, but in your current RDBMs data model you aren't doing any joins, everything you need is in indexed rows. It seems to me that yours is a use case where the queries you plan to use span huge subgraphs, and the data model is actually more complicated in a graph than it is in an RDBMs, and you're not getting much out of that added complexity with the queries you've provided.

When you're considering graph databases, what should really drive your decision is the queries you plan on making on it, and in general what questions you might ask of the relationships in that data, and if those questions are hard to answer in your current db. It seems to me that if your example queries are representative of those you plan to make regularly, your current db handles this just fine. If you were asking questions that were harder to answer in your current solution, and were more relationship-based (such as product suggestions for a user based upon what other users who have bought or viewed those products have bought), a graph db solution would make more sense, and could either be used for realtime queries, or have query results cached and renewed periodically.

PERFORMANCE IMPROVEMENT EDIT

It still seems to me that since you have these :Facts nodes that you really don't have to make many traversals. But then that's exactly like a RDBMS, so with these kind of queries, an RDBMS performs better.

MATCH (sku: SKU)-[:FactSKU]->(fct: Facts)
WHERE sku.SKU IN ["A","B","C".....]
AND fct.Country IN ["US", "India", "UK"...]
RETURN COUNT(DISTINCT fct.SessionId)

In this query (assuming sku.SKU is unique or indexed), you would only be using the graph to optimize finding :Facts related to the product (since you directly get back all relevant :Facts instead of having to filter based on the product). At that point, since the Country field is already present on the :Fact object, you already have everything you need to filter, so just do that there.

You may want to compare this to a purely relational query here for fun:

MATCH (fct: Facts)
WHERE fct.SKU IN ["A","B","C".....]
AND fct.Country IN ["US", "India", "UK"...]
RETURN COUNT(DISTINCT fct.SessionId)

Upvotes: 3

Frank Pavageau
Frank Pavageau

Reputation: 11705

The answers from Tore and InverseFalcon already contain a lot of good points, but I'll add a few more points to consider.

Q1

Traversing nodes and relationships is cheap, but not traversing them is even cheaper, especially when it's repeated! The Fact node duplicates information which really seems to belong to the Session node: take the Geography for example, it feels like there should only be one per session (unless knowing a user switched locale is important).

Q2

114 GB is a lot of heap for a JVM. Is this configuration the result of some measurements, or just throwing more memory at the problem? A huge heap has several drawbacks:

Unless you see requests consuming large quantities of heap, you should limit the heap, leaving more memory for the page cache.

Neo4j can hold billions of nodes and relationships, but the more one query touches, the longer it will take. Neo4j can be more efficient than a RDBMS if you only touch a limited, local sub-graph, even it's part of a really large one, because once the starting nodes have been found, traversing relationships to other nodes only requires to chase pointers (as it's usually presented) instead of using indices (index-free adjacency). See this SO question for example.


EDIT 1

Regarding the memory, I'd really try a different setup, with less heap for the JVM but a larger page cache, given the size of the database: instead of 114 GB / 4 GB, perhaps 64 GB / 48 GB.

About the model, given the volume, I don't think moving everything to properties in the relationships would help, quite the contrary: with nodes, you only have to do the lookup once, then look for relationships, whereas with properties you have to compare the properties for each and every relationship. I've actually gained a lot of performance in the graph traversals of a previous project by replacing systematic property comparisons by a node lookup followed by a simple comparison of ids. You also create multiple relationships with Geography when you functionally had a single one before, which can change the way some Geography-related queries behave (or are even written).

And about Q4, the database is really doing what you've told it to: using WITH, you create "barriers" which force the shape of the data at that point. Since you ask the engine to create a cartesian product of the Geography and SKU nodes, that's what it does even it's just used to try and find related Facts. Cypher is mostly declarative, but there are still a few ways to shape how the computation happen, with different outcomes in terms of performance:

  • WITH
  • USING INDEX
  • any(...) vs size(filter(...)) > 0
  • etc.

Upvotes: 3

Tore Eschliman
Tore Eschliman

Reputation: 2507

tl,dr: You are almost certainly missing an index on a property. Double-check your schema for indexes on geo.Country, sku:SKU, and Sessions.SessionIds.

Q1) There is only one tipping point: If you reference any unindexed properties at all in a query, your query will slow down massively. Unindexed properties are for data storage, not querying, unless you can seriously narrow the number of nodes being checked first.

Q2) Your bottleneck is almost certainly a missing index. Double check your :schema to make sure that every property you access in the query, on any label, is indexed.

Q3) Those two queries are almost exactly equivalent, yes. Unless you need extremely responsive individual results, any efficiency gain from one or the other is going to be negligible compared to the rest of the query.

Q4) Your understanding is basically correct. This query creates a row for each matching Geo, then for each of those rows creates a query for each matching Sku, even though there is no potential connection between the two. It then filters out all of the rows where a connection can't exist. Matching the pattern including the intermediate Fact in the earlier queries ensures that those rows are never created or tracked.

EDIT: See InverseFalcon below for a better perspective on how to model the data in a graph and whether it's worth it in the first place.

EDIT 2: After seeing your full query, I think the points Frank and Inverse made are even more relevant. If you needed to frequently access the most current data for a single Geo, then a graph query would likely be more performant, but you're pulling a report that touches a very large portion of your graph, and your data's heavily structured, so it's going to be hard to beat a regular relational database.

Upvotes: 3

Related Questions