Reputation: 11
I am trying to use neo4j in my application. Now I am facing a few critical problems in the experiment. The problems statement is divided into such several parts.
BACKGROUND:
The use case is getting data from internet, the data scale is billion, the scene is real time, the relationships is just person-to-person with several properties.
CONFIGURATION:
Machine configuration:
cpu: 24 processors, Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
memory: 165 203 696 kB
jdk: java version "1.7.0_67", Java(TM) SE Runtime Environment (build 1.7.0_67-b01), Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
Linux version: 2.6.32-431.el6.x86_64
OS: CentOS release 6.5
Neo4j configuration:
enterprise version: 2.1.5
jvm heap: default
objects cache:
neostore.nodestore.db.mapped_memory=512M
neostore.relationshipstore.db.mapped_memory=6G
neostore.propertystore.db.mapped_memory=5G
neostore.propertystore.db.strings.mapped_memory=1G
neostore.propertystore.db.arrays.mapped_memory=1G
client configuration:
pyneo, version 1.6.4
CODE IN CLIENT:
CYPHER_WEIGHT_COMPUTE='r.weight=r.weight+r.weight*EXP((TIMESTAMP()-r.update_time)/(r.half_life*1.0))'
// Initiation, create constraints according the label on id
self.query=neo.CypherQuery(self.graph_db,'CREATE CONSTRAINT ON (pn:UID)
ASSERT pn.id IS UNIQUE')
self.query.execute()
self.query=neo.CypherQuery(self.graph_db,'CREATE CONSTRAINT ON (pm:GID)
ASSERT pm.id IS UNIQUE')
self.query.execute()
// Cypher clause
MERGE(first:{TYPE1} {{id:'{val1}'}})
MERGE (second:{TYPE2} {{id:'{val2}'}})
MERGE (first)-[r:{RTYPE}]->(second) ON CREATE SET r.weight={weight_set} ON
MATCH SET {weight_compute}
WITH r
SET r.half_life={half_life},
r.update_time=TIMESTAMP(),
r.threshold={threshold}
WITH r
WHERE r.weight<r.threshold
DELETE r
self.query=neo.CypherQuery(self.graph_db,self.create_rels.
format(TYPE1=entity1[0],val1=entity1[1],
TYPE2=entity2[0],val2=entity2[1],
RTYPE=rel_type,weight_set=weight_set,
weight_compute=CYPHER_WEIGHT_COMPUTE,half_life=half_life,threshold=threshold))
RESULT:
When i use 24 python threads with py2neo writing 59229 nodes, 236048 relationships, 531325 properties. average time expense is about 1316 seconds. The result cann't meet my real time need, it will work well for me if the time expense decrease to 150 seconds. And the time expense for each node/relationship will increase when the scale of the data increases
QUESTIONS:
Is there any other way of improving the write performance except for optimizing the cypher clause and using batch insertion? I have tried the way of configuring different size of jvm heap and objects cache. And i found that it effected less on the write performance, i think the reason is maybe the small scale of nodes/relationships (from thousands to ten thousands), the efficiency may be significant in the big scale of nodes/relationships( ten millions, billions)
How much nps or rps can neo4j' read and write performance reach according your experience in a scale of billion nodes/relationships?
I also found that neo4j cann't do sharding automatically, but there is one section about cache-based sharding in the document, If i use cache-based sharding with HAproxy, how can the relationships between the nodes which has been sharding to different machines maintain? That's to say the relationships are not broken by the sharding.
Can master/slave mode use in both community and enterprise version?
Thanks in advance.
Regards
Upvotes: 1
Views: 866
Reputation: 33145
Must you do these as different requests? I recommend you use the transactional cypher endpoint: http://nigelsmall.com/py2neo/1.6/cypher/#id2
Depends on the query and API used. And your method of counting reads and writes. With the transactional http API, I've managed to get 30k cypher create statements per second with two nodes and a rel. Merge is a fair bit slower, and you need to make sure you're using the constraint index.
The idea is to keep a subset of the data cached by allowing a subset of users (or subset you can define) to query on particular cluster nodes. If the relationship a query needs to follow isn't cached, it will end up reading from disk. All data must be on the disks of all members of the cluster.
I'm not certain, but I'm pretty sure all the clustering features come with enterprise.
Upvotes: 5