headmyshoulder
headmyshoulder

Reputation: 6310

Using Neo4j to store a lot of small xml documents

I have a large set of xml documents, say 1.000.000.000 and each contains from 10 to 1000 elements. I wonder whether it is possible to use neo4j to store and access this data.

I see two possible ways to generate the graph. First, I could store each document as a separate graph. For example if I have to documents with

 # document one
 a -> b
 b -> c
 # document two
 a -> d
 a -> c

I need to "rename" the nodes to distinguish between two nodes from two different documents.

 (a1) --> (b1)
 (b1) --> (c1)
 (a2) --> (d2)
 (a2) --> (c2)

On the other hand I could also treat elements with the same name as unique, so I get a strongly interconnected graph:

 (a) -[:Document { doc : 1 } ]-> (b)
 (b) -[:Document { doc : 1 } ]-> (c)
 (a) -[:Document { doc : 2 } ]-> (d)
 (a) -[:Document { doc : 2 } ]-> (c)

For my use case both ways are thinkable. But I wonder if neo4j is designed for "a large number of small graphs" or "a small but strongly interconnected graph". Any ideas?

Upvotes: 1

Views: 1032

Answers (2)

Sebastian Good
Sebastian Good

Reputation: 6360

TL;DR I don't think Neo4J cares in principle which method you choose, but the method should be driven first by your desired semantics, and in practice you may find that if your data is large and connected enough enough Neo's lack of vertex-centric indexes may push you to a database like Titan. Or you may be over-modeling the problem and should just use MySQL.

Longer explanation: A billion documents with a thousand elements each (and an unspecified connectivity between those elements) starts to sound like a few terabytes of data. At this scale, regardless of database engine, you need to understand anticipated query plans unless you plan your database to be simply custodial.

In your former approach, you will make it very easy indeed to find and traverse any single document. You could even argue a graph database is an unnecessary luxury: just keep a little GraphML/GraphSON/whatever snippet of text for each graph and throw them all in a document database (e.g. LevelDB etc) or simple table (e.g. Cassandra, MySQL). If you only want one at a time, and it's only a few thousand elements, load that 1-2MB of data into an in-memory graph library and ask your questions. This implies there is no semantic link between similar elements in different documents.

But it seems there must be semantic links: If a in document 1 is the same as a in document 2, your second approach allows a far richer set of queries. At this point you need to understand how large the relationship (edge) counts will be for each document (vertex). Without knowing the distribution of data between your documents, it's difficult to guess whether this will create problems, but a very typical one is the problem of a so-called 'super node' where one of your nodes (a) ends up with millions or even hundreds of millions of relationships (edges). These can be a modeling problem and can cause you to run out of memory in single-node solutions like Neo. Most graph databases now have some tools to help you deal with these super nodes, but it's worth thinking a bit about what your data distributions look like. When your query arrives at a super node, a graph database which allows you to distribute your data or do more sophisticated indexing of edge properties may be appropriate. For instance, a might connect to 100,000,000 other documents, but perhaps those documents have properties which you can use to restrict the number of edges to consider when traversing.

(modeling side node: as the number of connections any vertex has (e.g. a in the example above) approaches the number of vertices in the graph, the amount of information you're actually storing is decreasing. If a is connected to everything, then you can actually stop storing those edges and just assume it some other way.)

Upvotes: 3

FylmTM
FylmTM

Reputation: 2007

TL;DR; - technically Neo4j is able to work with any amount of data size (imited by the address space), without any problems (generally). But data model should be build carefully, to be able efficiently traverse data.
You can use any approach you want, unless it allows you to efficiently query your data. It depends on your use-case and needs.

There are no any drawbacks on storing a lot of separate small “graph” that are not connected to each other.

Native property graph

It’s important to look into how Neo4j works with your data, to find limits.

Neo4j implements “native graph” idea. This means that data on disk are stored already optimized for graph traversal.

Basically this means that doing traverse from node a to node b takes O(1) time (that is not really true, but let’s stick with that for simplicity).

On disk each record is stored as fixed-size record in file. For example Node is 9 bytes longs. So, getting node with id 156 is just reading 9 bytes starting from offset 156 * 9;

What we have: - We can read records from disk in constant time - We can hop from one node to another node in constant time.

Capacity

http://neo4j.com/docs/stable/capabilities-capacity.html

In Neo4j, data size is mainly limited by the address space of the primary keys for Nodes, Relationships, Properties and RelationshipTypes. Currently, the address space is as follows:

nodes: 235 (∼ 34 billion)
relationships: 235 (∼ 34 billion)
properties: 236 to 238 depending on property types (maximum ∼ 274 billion, always at least ∼ 68 billion)
relationship types: 216 (∼ 65 000)

Basically we can store a lot of data in database.

Hardware

We are limited to existing hardware. Read speed from disks is slow. So, Neo4j has pagecache. Basically this cache is one-to-one representation of on-disk storage files in RAM.

Ideally we wan’t our database to be fully available in RAM to get best read speed.

If we can’t fit whole database in RAM, then Neo4j will swap pages in and out, depending on what part of database is currently used.

Constantly swapping cache pages, can be cause of poor traverse performance.

Indexes

There are indexes available in Neo4j. They are needed, to make it possible to query data by properties in nodes. Querying via index is not constant time, but it is pretty fast anyway. When “start” node is found, we can traverse from that node in almost constant time.

Relationships

So, when you are traversing from one node to other node, you should find out what relationship should be used.

To do that, Neo4j iterates all relationships for start node until it founds appropriate one and go further (gets endNode and repeats process).

That gives us:

  • If relationship count is small (i.e. < 10), then traversing is almost O(1).
  • If relationship coutn is large (i.e. > 1000), then traversing is slower. Because Neo4j will traverse relationship to find appropriate.

It should be taken in consideration when building graph data model.
General recommendation - try not to use dense nodes in model.

Other notes

Neo4j is capable to work with a HUGE data sizes. You just need to take in consideration how Neo4j work with your data and execute traversals, to make your queries efficient enough.

To prove my words you can look into this one - Introducing Neo4j on IBM POWER8.

Upvotes: 4

Related Questions