Reputation: 21
I am testing TitanDB + Cassandra now. Graph Schema like this:
VERTEX: USER(userId), IP(ip), SESSION_ID(sessionId), DEVICE(deviceId)
EDGE: USER->IP, USER->SESSION_ID, USER->DEVICE
DATA SIZE: Vertex 100Million, Edge: 1 billion
Index: Vertex-Centric index on all kinds of edge . Index for userId, ip, sessionId, and deviceId.
Set Vertext partition for IP, DEVICE and SESSION_ID. Total 32 partition.
Cassandra hosts:AWS EC2 I2 (2xlage) x 24 . Currently, every host hold about 30G data.
Usecase: give a userId with a edgeLabel, find out all related users by this edge's out vertex.
for example: g.V().has(T.label, 'USER').has('USER_ID', '12345').out('USER_IP').in().valueMap();
But this kinds of query is pretty slow, sometimes, hundreds seconds. One user can have many related IP (hundreds), so from these IPs, it also can get lots of USERs (thousands).
Does Titan parallel query for this kind of query against all partition of backend storage?? I try to use limit:
g.V().has(T.label, 'USER').has('USER_ID', '12345').out('USER_IP').limit(50).in().limit(100).valueMap()
It's also slow. I hope this kinds of query can be done in 5seconds. How the Titan limit() works? Get all result first, then 'limit' ??
How to increase the performance for it? Can anyone give some advice?
Upvotes: 1
Views: 205
Reputation: 3565
One quick perfomance gain you could get is from using Titan's Vertex Centric Indices this allows you to make very quick leaps from one vertex to another. For example you could try something like this:
mgmt = graph.openManagement()
userId = mgmt.getPropertyKey('userId')
userIp = mgmt.getEdgeLabel('USER_IP')
mgmt.buildEdgeIndex(userIp, 'userIdByUserIP', Direction.BOTH, Order.decr, time)
mgmt.commit()
To create a simple vertex centric index.
If you want to lookup multiple user ips from multiple user vertices then you could try using Titan-Hadoop. However, that is a more involved process.
Upvotes: 1