Reputation: 31
I've been wrangling with Neo4J for the last few weeks, trying to resolve some extremely challenging performance problems. At this point, I need some additional help because I can't determine how to move forward.
I have a graph with a total of approx 12.5 Million nodes and 64 Million relationships. The purpose of the graph is going to be analyzing suspicious financial behavior, so it is customers, accounts, transactions, etc.
Here is an example of the performance challenge:
This query for total nodes takes 96,064ms to complete, which is extremely long.
neo4j-sh (?)$ MATCH (n) RETURN count(n);
+----------+
| count(n) |
+----------+
| 12519940 |
+----------+
1 row
96064 ms
The query for total relationships takes 919,449ms to complete, which seems silly.
neo4j-sh (?)$ MATCH ()-[r]-() return count(r);
+----------+
| count(r) |
+----------+
| 64062508 |
+----------+
1 row
919449 ms
I have 6.6M Transaction Nodes. When I attempt to search for transactions that have an amount above $8,000, the query takes 653,637ms also way too long.
neo4j-sh (?)$ MATCH (t:Transaction) WHERE t.amount > 8000.00 return count(t);
+----------+
| count(t) |
+----------+
| 10696 |
+----------+
1 row
653637 ms
Relevant Schema
ON :Transaction(baseamount) ONLINE
ON :Transaction(type) ONLINE
ON :Transaction(amount) ONLINE
ON :Transaction(currency) ONLINE
ON :Transaction(basecurrency) ONLINE
ON :Transaction(transactionid) ONLINE (for uniqueness constraint)
Profile of Query:
neo4j-sh (?)$ PROFILE MATCH (t:Transaction) WHERE t.amount > 8000.00 return count(t);
+----------+
| count(t) |
+----------+
| 10696 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+NodeByLabel
+------------------+---------+----------+-------------+------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+---------+----------+-------------+------------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(t) |
| EagerAggregation | 1 | 0 | | |
| Filter | 10696 | 13216382 | | Property(t,amount(62)) > { AUTODOUBLE0} |
| NodeByLabel | 6608191 | 6608192 | t, t | :Transaction |
+------------------+---------+----------+-------------+------------------------------------------+
I am running these in the neo4j shell.
The performance challenges here are starting to create substantial doubt about whether I can even use Neo4J, and seem opposite of the potential the platform offers.
I am fully admit that I may have misconfigured something (I'm relatively new to Neo4J), so guidance on what to fix or what to look at is much appreciated.
Here are details of my setup:
System: Linux, Ubuntu, 16GB RAM, 3.5 i5 Proc, 256GB SSD HD
CPU
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
stepping : 3
microcode : 0x12
cpu MHz : 4230.625
cache size : 6144 KB
Memory
$ cat /proc/meminfo
MemTotal: 16115020 kB
MemFree: 224856 kB
MemAvailable: 8807160 kB
Buffers: 124356 kB
Cached: 8429964 kB
SwapCached: 8388 kB
Disk
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/data1--vg-root 219G 32G 177G 16% /
Neo4J.properties
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1G
neostore.relationshipgroupstore.db.mapped_memory=200M
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=50M
neostore.propertystore.db.index.keys.mapped_memory=200M
relationship_auto_indexing=true
Neo4J-Wrapper.properties
wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties
#********************************************************************
# JVM Parameters
#********************************************************************
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
# Uncomment the following lines to enable garbage collection logging
wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
wrapper.java.additional=-XX:+PrintGCDetails
wrapper.java.additional=-XX:+PrintGCDateStamps
wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
wrapper.java.additional=-XX:+PrintPromotionFailure
wrapper.java.additional=-XX:+PrintTenuringDistribution
# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=6144
Other:
Changed the open file settings for Linux to 40k
I am not running anything else on this machine, no X Windows, no other DB server. Here is a snippet of top while running a query:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15785 neo4j 20 0 12.192g 8.964g 2.475g S 100.2 58.3 227:50.98 java
1 root 20 0 33464 2132 1140 S 0.0 0.0 0:02.36 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
The total file size in the graph.db directory is:
data/graph.db$ du --max-depth=1 -h
1.9G ./schema
36K ./index
26G .
Data loading was extremely hit or miss. Some merges would take less than 60 seconds (Even for ~200 to 300K inserts), while some merges would last for over 3 hours (11,898,514ms for a CSV file with 189,999 rows merging on one date)
I get constant GC thread blocking:
2015-03-27 14:56:26.347+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 15422ms.
2015-03-27 14:56:39.011+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 12363ms.
2015-03-27 14:56:57.533+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 13969ms.
2015-03-27 14:57:17.345+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 14657ms.
2015-03-27 14:57:29.955+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 12309ms.
2015-03-27 14:58:14.311+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 1928ms.
Please let me know if I should add anything else that would be salient to the discussion
Update 1
Thank you very much for your help, I just moved so I was delayed in responding.
Size of Neostore Files:
/data/graph.db$ ls -lah neostore.*
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.id
-rw-rw-r-- 1 neo4j neo4j 110 Apr 2 13:03 neostore.labeltokenstore.db
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.labeltokenstore.db.id
-rw-rw-r-- 1 neo4j neo4j 874 Apr 2 13:03 neostore.labeltokenstore.db.names
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.labeltokenstore.db.names.id
-rw-rw-r-- 1 neo4j neo4j 200M Apr 2 13:03 neostore.nodestore.db
-rw-rw-r-- 1 neo4j neo4j 41 Apr 2 13:03 neostore.nodestore.db.id
-rw-rw-r-- 1 neo4j neo4j 68 Apr 2 13:03 neostore.nodestore.db.labels
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.nodestore.db.labels.id
-rw-rw-r-- 1 neo4j neo4j 2.8G Apr 2 13:03 neostore.propertystore.db
-rw-rw-r-- 1 neo4j neo4j 128 Apr 2 13:03 neostore.propertystore.db.arrays
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.arrays.id
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.id
-rw-rw-r-- 1 neo4j neo4j 720 Apr 2 13:03 neostore.propertystore.db.index
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.index.id
-rw-rw-r-- 1 neo4j neo4j 3.1K Apr 2 13:03 neostore.propertystore.db.index.keys
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.index.keys.id
-rw-rw-r-- 1 neo4j neo4j 1.7K Apr 2 13:03 neostore.propertystore.db.strings
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.propertystore.db.strings.id
-rw-rw-r-- 1 neo4j neo4j 47M Apr 2 13:03 neostore.relationshipgroupstore.db
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.relationshipgroupstore.db.id
-rw-rw-r-- 1 neo4j neo4j 1.1G Apr 2 13:03 neostore.relationshipstore.db
-rw-rw-r-- 1 neo4j neo4j 1.6M Apr 2 13:03 neostore.relationshipstore.db.id
-rw-rw-r-- 1 neo4j neo4j 165 Apr 2 13:03 neostore.relationshiptypestore.db
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.relationshiptypestore.db.id
-rw-rw-r-- 1 neo4j neo4j 1.3K Apr 2 13:03 neostore.relationshiptypestore.db.names
-rw-rw-r-- 1 neo4j neo4j 9 Apr 2 13:03 neostore.relationshiptypestore.db.names.id
-rw-rw-r-- 1 neo4j neo4j 3.5K Apr 2 13:03 neostore.schemastore.db
-rw-rw-r-- 1 neo4j neo4j 25 Apr 2 13:03 neostore.schemastore.db.id
I read that mapped memory settings are replaced by another cache, and I have commented out those settings.
Java Profiler
JvmTop 0.8.0 alpha - 16:12:59, amd64, 4 cpus, Linux 3.16.0-33, load avg 0.30
http://code.google.com/p/jvmtop
Profiling PID 4260: org.neo4j.server.Bootstrapper
68.67% ( 14.01s) org.neo4j.kernel.impl.nioneo.store.StoreFileChannel.read()
18.73% ( 3.82s) org.neo4j.kernel.impl.nioneo.store.StoreFailureException.<init>()
2.86% ( 0.58s) org.neo4j.kernel.impl.cache.ReferenceCache.put()
1.11% ( 0.23s) org.neo4j.helpers.Counter.inc()
0.87% ( 0.18s) org.neo4j.kernel.impl.cache.ReferenceCache.get()
0.65% ( 0.13s) org.neo4j.cypher.internal.compiler.v2_1.parser.Literals$class.PropertyKeyName()
0.63% ( 0.13s) org.parboiled.scala.package$.getCurrentRuleMethod()
0.62% ( 0.13s) scala.collection.mutable.OpenHashMap.<init>()
0.62% ( 0.13s) scala.collection.mutable.AbstractSeq.<init>()
0.62% ( 0.13s) org.neo4j.kernel.impl.cache.AutoLoadingCache.get()
0.61% ( 0.13s) scala.collection.TraversableLike$$anonfun$map$1.apply()
0.61% ( 0.12s) org.neo4j.kernel.impl.transaction.TxManager.assertTmOk()
0.61% ( 0.12s) org.neo4j.cypher.internal.compiler.v2_1.commands.EntityProducerFactory.<init>()
0.61% ( 0.12s) scala.collection.AbstractTraversable.<init>()
0.61% ( 0.12s) scala.collection.immutable.List.toStream()
0.60% ( 0.12s) org.neo4j.kernel.impl.nioneo.store.NodeStore.getRecord()
0.57% ( 0.12s) org.neo4j.kernel.impl.transaction.TxManager.getTransaction()
0.37% ( 0.08s) org.parboiled.scala.Parser$class.rule()
0.06% ( 0.01s) scala.util.DynamicVariable.value()
Upvotes: 3
Views: 354
Reputation: 39926
Unfortunately the schema indexes (aka those created using CREATE INDEX ON :Label(property)
) do not yet support larger than/smaller than conditions. Therefore Neo4j falls back to scan all nodes with the given label and filter on their properties. This is of course expensive.
I do see two different approaches to tackle this:
1) If your condition does always have a pre-defined maximum granularity e.g. 10s of USDs, you can build up an "amount-tree" similar to a time-tree (see http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html).
2) if you don't know the granularity upfront the other option is to setup a manual or auto index for the amount property, see http://neo4j.com/docs/stable/indexing.html. The most easy thing is probably using auto index. In neo4j.properties
set the following options:
node_auto_indexing=true
node_keys_indexable=amount
Note that this will not automatically add all existing transaction into that index, it just puts those in the index that have been written to since auto indexing is enabled.
You can do a explicit range query on the auto index using
MATCH t=node:node_auto_index("amount:[6000 TO 999999999]")
RETURN count(t)
Upvotes: 2