Remis Haroon - رامز
Remis Haroon - رامز

Reputation: 3572

TITAN : Identify and remove duplicate vertices in graph

I am using TITAN 0.4 over Cassandra, I have indexed my key ("ip_address" in my case), but as NON-UNIQUE, for performance and scalability. Now the challenge is graph allows duplicates vertices. I am running a background task to cleanup the duplicate vertices in graph, by iterating through all vertices. What is the best way or approach to identify a duplicate vertex in a graph. The the estimated size of graph in production is around 10M ~ 15M vertices or even more than that. Is there any feature exist in TITAN index, which helps to easily identify a duplicate? Thanks in advance

Index creation Gremlin script

g.makeKey("ip_address").dataType(String.class).indexed("standard",Vertex.class).make();

Upvotes: 0

Views: 729

Answers (1)

Daniel Kuppitz
Daniel Kuppitz

Reputation: 10904

I would start with a Titan/Hadoop job:

g.V().ip_address.groupCount()

Then use those IP addresses with a count > 1 to clean up / merge duplicated vertices in OLTP mode.

Upvotes: 0

Related Questions