Reputation: 31
I have huge dataset that must be inserted into a graph database via gremlin (gremlin server). Becuase the xml file is too big (over 8gb in size), I decided to split it into multiple, manageable nine xml files (about 1gb each). My question is, is there a way to use the insert each of these data files into my tinkerpop graph database via gremlin server? i.e. trying something like this? or, what is the best way to insert these data please?
graph.io(IoCore.graphml()).readGraph("data01.xml")
graph.io(IoCore.graphml()).readGraph("data02.xml")
graph.io(IoCore.graphml()).readGraph("data03.xml")
graph.io(IoCore.graphml()).readGraph("data04.xml")
graph.io(IoCore.graphml()).readGraph("data05.xml")
Upvotes: 1
Views: 278
Reputation: 46226
That's a large GraphML file. I'm not sure I've ever come across one so large. I'd wonder how you went about splitting it since GraphML files aren't easily split as they are XML based, have a header and a structure where vertices and edges are in separate nodes. It was for these reasons (and others) that TinkerPop developed formats like Gryo and GraphSON which were easily split for processing in Hadoop-like file structures.
That said, assuming you split your GraphML files correctly so that each file was a complete subgraph I suppose that you would be able to load them the way you are suggesting, however, I'd be concerned about how much memory you'd need to do this. The io()
loader is not mean for bulk parallel loading and basically holds an in-memory cache of vertices to speed loading. That in-memory cache is essentially just a HashMap
that doesn't expire its contents. Therefore, while the loading is occurring you would need to be able to hold all of the Vertex
instances in memory for a particular file.
I don't know what your requirements are or how you ended up with such a large GraphML file, but for a graph of this size I would look at either the provider specific bulk loading tools of the graph you are using or some custom method that used spark-gremlin or a Gremlin script of some sort to parallel load your data.
Upvotes: 0