Neo4j Performance Challenge - How to Improve?

Question

I've been wrangling with Neo4J for the last few weeks, trying to resolve some extremely challenging performance problems. At this point, I need some additional help because I can't determine how to move forward.

I have a graph with a total of approx 12.5 Million nodes and 64 Million relationships. The purpose of the graph is going to be analyzing suspicious financial behavior, so it is customers, accounts, transactions, etc.

Here is an example of the performance challenge:

This query for total nodes takes 96,064ms to complete, which is extremely long.

neo4j-sh (?)$ MATCH (n) RETURN count(n);
+----------+
| count(n) |
+----------+
| 12519940 |
+----------+
1 row
96064 ms

The query for total relationships takes 919,449ms to complete, which seems silly.

neo4j-sh (?)$ MATCH ()-[r]-() return count(r);
+----------+
| count(r) |
+----------+
| 64062508 |
+----------+
1 row
919449 ms

I have 6.6M Transaction Nodes. When I attempt to search for transactions that have an amount above $8,000, the query takes 653,637ms also way too long.

neo4j-sh (?)$ MATCH (t:Transaction) WHERE t.amount > 8000.00 return count(t);        
+----------+
| count(t) |
+----------+
| 10696    |
+----------+
1 row
653637 ms

Relevant Schema

 ON :Transaction(baseamount)    ONLINE                             
 ON :Transaction(type)          ONLINE                             
 ON :Transaction(amount)        ONLINE                             
 ON :Transaction(currency)      ONLINE                             
 ON :Transaction(basecurrency)  ONLINE                             
 ON :Transaction(transactionid) ONLINE (for uniqueness constraint)

Profile of Query:

neo4j-sh (?)$ PROFILE MATCH (t:Transaction) WHERE t.amount > 8000.00 return count(t);  
+----------+
| count(t) |
+----------+
| 10696    |
+----------+
1 row

ColumnFilter
  |
  +EagerAggregation
    |
    +Filter
      |
      +NodeByLabel

+------------------+---------+----------+-------------+------------------------------------------+
|         Operator |    Rows |   DbHits | Identifiers |                                    Other |
+------------------+---------+----------+-------------+------------------------------------------+
|     ColumnFilter |       1 |        0 |             |                    keep columns count(t) |
| EagerAggregation |       1 |        0 |             |                                          |
|           Filter |   10696 | 13216382 |             | Property(t,amount(62)) > {  AUTODOUBLE0} |
|      NodeByLabel | 6608191 |  6608192 |        t, t |                             :Transaction |
+------------------+---------+----------+-------------+------------------------------------------+

I am running these in the neo4j shell.
The performance challenges here are starting to create substantial doubt about whether I can even use Neo4J, and seem opposite of the potential the platform offers.
I am fully admit that I may have misconfigured something (I'm relatively new to Neo4J), so guidance on what to fix or what to look at is much appreciated.

Here are details of my setup:

System: Linux, Ubuntu, 16GB RAM, 3.5 i5 Proc, 256GB SSD HD

CPU

$ cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 60
model name  : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
stepping    : 3
microcode   : 0x12
cpu MHz     : 4230.625
cache size  : 6144 KB

Memory

$ cat /proc/meminfo
MemTotal:       16115020 kB
MemFree:          224856 kB
MemAvailable:    8807160 kB
Buffers:          124356 kB
Cached:          8429964 kB
SwapCached:         8388 kB

Disk

$ df -h
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/data1--vg-root  219G   32G  177G  16% /

Neo4J.properties

neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1G
neostore.relationshipgroupstore.db.mapped_memory=200M
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=50M
neostore.propertystore.db.index.keys.mapped_memory=200M
relationship_auto_indexing=true

Neo4J-Wrapper.properties

wrapper.java.additional=-Dorg.neo4j.server.properties=conf/neo4j-server.properties
wrapper.java.additional=-Djava.util.logging.config.file=conf/logging.properties
wrapper.java.additional=-Dlog4j.configuration=file:conf/log4j.properties

#********************************************************************
# JVM Parameters
#********************************************************************

wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow

# Uncomment the following lines to enable garbage collection logging
wrapper.java.additional=-Xloggc:data/log/neo4j-gc.log
wrapper.java.additional=-XX:+PrintGCDetails
wrapper.java.additional=-XX:+PrintGCDateStamps
wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime
wrapper.java.additional=-XX:+PrintPromotionFailure
wrapper.java.additional=-XX:+PrintTenuringDistribution

# Java Heap Size: by default the Java heap size is dynamically
# calculated based on available system resources.
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=6144

Other:

Changed the open file settings for Linux to 40k

I am not running anything else on this machine, no X Windows, no other DB server. Here is a snippet of top while running a query:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                
15785 neo4j     20   0 12.192g 8.964g 2.475g S 100.2 58.3 227:50.98 java                                                                                                                   
1 root      20   0   33464   2132   1140 S   0.0  0.0   0:02.36 init                                                                                                                   
2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd

The total file size in the graph.db directory is:

data/graph.db$ du --max-depth=1 -h
1.9G    ./schema
36K ./index
26G .

Data loading was extremely hit or miss. Some merges would take less than 60 seconds (Even for ~200 to 300K inserts), while some merges would last for over 3 hours (11,898,514ms for a CSV file with 189,999 rows merging on one date)

I get constant GC thread blocking:

2015-03-27 14:56:26.347+0000 WARN  [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 15422ms.
2015-03-27 14:56:39.011+0000 WARN  [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 12363ms.
2015-03-27 14:56:57.533+0000 WARN  [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 13969ms.
2015-03-27 14:57:17.345+0000 WARN  [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 14657ms.
2015-03-27 14:57:29.955+0000 WARN  [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 12309ms.
2015-03-27 14:58:14.311+0000 WARN  [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for 1928ms.

Please let me know if I should add anything else that would be salient to the discussion

Update 1

Thank you very much for your help, I just moved so I was delayed in responding.

Size of Neostore Files:

/data/graph.db$ ls -lah neostore.*
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.id
-rw-rw-r-- 1 neo4j neo4j  110 Apr  2 13:03 neostore.labeltokenstore.db
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.labeltokenstore.db.id
-rw-rw-r-- 1 neo4j neo4j  874 Apr  2 13:03 neostore.labeltokenstore.db.names
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.labeltokenstore.db.names.id
-rw-rw-r-- 1 neo4j neo4j 200M Apr  2 13:03 neostore.nodestore.db
-rw-rw-r-- 1 neo4j neo4j   41 Apr  2 13:03 neostore.nodestore.db.id
-rw-rw-r-- 1 neo4j neo4j   68 Apr  2 13:03 neostore.nodestore.db.labels
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.nodestore.db.labels.id
-rw-rw-r-- 1 neo4j neo4j 2.8G Apr  2 13:03 neostore.propertystore.db
-rw-rw-r-- 1 neo4j neo4j  128 Apr  2 13:03 neostore.propertystore.db.arrays
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.propertystore.db.arrays.id
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.propertystore.db.id
-rw-rw-r-- 1 neo4j neo4j  720 Apr  2 13:03 neostore.propertystore.db.index
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.propertystore.db.index.id
-rw-rw-r-- 1 neo4j neo4j 3.1K Apr  2 13:03 neostore.propertystore.db.index.keys
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.propertystore.db.index.keys.id
-rw-rw-r-- 1 neo4j neo4j 1.7K Apr  2 13:03 neostore.propertystore.db.strings
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.propertystore.db.strings.id
-rw-rw-r-- 1 neo4j neo4j  47M Apr  2 13:03 neostore.relationshipgroupstore.db
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.relationshipgroupstore.db.id
-rw-rw-r-- 1 neo4j neo4j 1.1G Apr  2 13:03 neostore.relationshipstore.db
-rw-rw-r-- 1 neo4j neo4j 1.6M Apr  2 13:03 neostore.relationshipstore.db.id
-rw-rw-r-- 1 neo4j neo4j  165 Apr  2 13:03 neostore.relationshiptypestore.db
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.relationshiptypestore.db.id
-rw-rw-r-- 1 neo4j neo4j 1.3K Apr  2 13:03 neostore.relationshiptypestore.db.names
-rw-rw-r-- 1 neo4j neo4j    9 Apr  2 13:03 neostore.relationshiptypestore.db.names.id
-rw-rw-r-- 1 neo4j neo4j 3.5K Apr  2 13:03 neostore.schemastore.db
-rw-rw-r-- 1 neo4j neo4j   25 Apr  2 13:03 neostore.schemastore.db.id

I read that mapped memory settings are replaced by another cache, and I have commented out those settings.

Java Profiler

   JvmTop 0.8.0 alpha - 16:12:59,  amd64,  4 cpus, Linux 3.16.0-33, load avg 0.30
   http://code.google.com/p/jvmtop

   Profiling PID 4260:            org.neo4j.server.Bootstrapper 

    68.67% (    14.01s) org.neo4j.kernel.impl.nioneo.store.StoreFileChannel.read()
    18.73% (     3.82s) org.neo4j.kernel.impl.nioneo.store.StoreFailureException.()
     2.86% (     0.58s) org.neo4j.kernel.impl.cache.ReferenceCache.put()
     1.11% (     0.23s) org.neo4j.helpers.Counter.inc()
     0.87% (     0.18s) org.neo4j.kernel.impl.cache.ReferenceCache.get()
     0.65% (     0.13s) org.neo4j.cypher.internal.compiler.v2_1.parser.Literals$class.PropertyKeyName()
     0.63% (     0.13s) org.parboiled.scala.package$.getCurrentRuleMethod()
     0.62% (     0.13s) scala.collection.mutable.OpenHashMap.()
     0.62% (     0.13s) scala.collection.mutable.AbstractSeq.()
     0.62% (     0.13s) org.neo4j.kernel.impl.cache.AutoLoadingCache.get()
     0.61% (     0.13s) scala.collection.TraversableLike$$anonfun$map$1.apply()
     0.61% (     0.12s) org.neo4j.kernel.impl.transaction.TxManager.assertTmOk()
     0.61% (     0.12s) org.neo4j.cypher.internal.compiler.v2_1.commands.EntityProducerFactory.()
     0.61% (     0.12s) scala.collection.AbstractTraversable.()
     0.61% (     0.12s) scala.collection.immutable.List.toStream()
     0.60% (     0.12s) org.neo4j.kernel.impl.nioneo.store.NodeStore.getRecord()
     0.57% (     0.12s) org.neo4j.kernel.impl.transaction.TxManager.getTransaction()
     0.37% (     0.08s) org.parboiled.scala.Parser$class.rule()
     0.06% (     0.01s) scala.util.DynamicVariable.value()

Neo4j Performance Challenge - How to Improve?

Answers (1)

Related Questions