Reputation: 3572
I am using TITAN 0.4 over Cassandra, I have indexed my key ("ip_address" in my case), but as NON-UNIQUE, for performance and scalability. Now the challenge is graph allows duplicates vertices. I am running a background task to cleanup the duplicate vertices in graph, by iterating through all vertices. What is the best way or approach to identify a duplicate vertex in a graph. The the estimated size of graph in production is around 10M ~ 15M vertices or even more than that. Is there any feature exist in TITAN index, which helps to easily identify a duplicate? Thanks in advance
Index creation Gremlin script
g.makeKey("ip_address").dataType(String.class).indexed("standard",Vertex.class).make();
Upvotes: 0
Views: 729
Reputation: 10904
I would start with a Titan/Hadoop job:
g.V().ip_address.groupCount()
Then use those IP addresses with a count > 1 to clean up / merge duplicated vertices in OLTP mode.
Upvotes: 0