Using Neo4j to store a lot of small xml documents

Question

I have a large set of xml documents, say 1.000.000.000 and each contains from 10 to 1000 elements. I wonder whether it is possible to use neo4j to store and access this data.

I see two possible ways to generate the graph. First, I could store each document as a separate graph. For example if I have to documents with

 # document one
 a -> b
 b -> c
 # document two
 a -> d
 a -> c

I need to "rename" the nodes to distinguish between two nodes from two different documents.

 (a1) --> (b1)
 (b1) --> (c1)
 (a2) --> (d2)
 (a2) --> (c2)

On the other hand I could also treat elements with the same name as unique, so I get a strongly interconnected graph:

 (a) -[:Document { doc : 1 } ]-> (b)
 (b) -[:Document { doc : 1 } ]-> (c)
 (a) -[:Document { doc : 2 } ]-> (d)
 (a) -[:Document { doc : 2 } ]-> (c)

For my use case both ways are thinkable. But I wonder if neo4j is designed for "a large number of small graphs" or "a small but strongly interconnected graph". Any ideas?

Sebastian Good · Accepted Answer

TL;DR I don't think Neo4J cares in principle which method you choose, but the method should be driven first by your desired semantics, and in practice you may find that if your data is large and connected enough enough Neo's lack of vertex-centric indexes may push you to a database like Titan. Or you may be over-modeling the problem and should just use MySQL.

Longer explanation: A billion documents with a thousand elements each (and an unspecified connectivity between those elements) starts to sound like a few terabytes of data. At this scale, regardless of database engine, you need to understand anticipated query plans unless you plan your database to be simply custodial.

In your former approach, you will make it very easy indeed to find and traverse any single document. You could even argue a graph database is an unnecessary luxury: just keep a little GraphML/GraphSON/whatever snippet of text for each graph and throw them all in a document database (e.g. LevelDB etc) or simple table (e.g. Cassandra, MySQL). If you only want one at a time, and it's only a few thousand elements, load that 1-2MB of data into an in-memory graph library and ask your questions. This implies there is no semantic link between similar elements in different documents.

But it seems there must be semantic links: If a in document 1 is the same as a in document 2, your second approach allows a far richer set of queries. At this point you need to understand how large the relationship (edge) counts will be for each document (vertex). Without knowing the distribution of data between your documents, it's difficult to guess whether this will create problems, but a very typical one is the problem of a so-called 'super node' where one of your nodes (a) ends up with millions or even hundreds of millions of relationships (edges). These can be a modeling problem and can cause you to run out of memory in single-node solutions like Neo. Most graph databases now have some tools to help you deal with these super nodes, but it's worth thinking a bit about what your data distributions look like. When your query arrives at a super node, a graph database which allows you to distribute your data or do more sophisticated indexing of edge properties may be appropriate. For instance, a might connect to 100,000,000 other documents, but perhaps those documents have properties which you can use to restrict the number of edges to consider when traversing.

(modeling side node: as the number of connections any vertex has (e.g. a in the example above) approaches the number of vertices in the graph, the amount of information you're actually storing is decreasing. If a is connected to everything, then you can actually stop storing those edges and just assume it some other way.)

Using Neo4j to store a lot of small xml documents

Answers (2)

Native property graph

Capacity

Hardware

Indexes

Relationships

Other notes

Related Questions