Reputation: 628
I have been going through documentation but I am unable to identify what the general guidelines are for bulk loading.
As far as I can see the best way to bulk load data into graphdb is by using the LoadRDF tool.
However the general rules for the appropriate settings are not familiar to me. First of all if you have an "average" server with an SSD drive what kind of parsing speed is acceptable? 1.000 statements / sec, 10.000 statements / sec or is it much more or less?
Also what are good settings? For example you can set the -Dpool.buffer.size which has a default of 200.000 statements but if you have 10gig of ram what would be the rule of thumb to increase this and if you have 100 or 300 gig of ram?
Another option is the -Dinfer.pool.size which is set to the maximum of threads as there are cpus with a minimum of 4. Thus 1 core = 4 threads and 32 cores is 32 threads. I think this does not require any extra tuning or is this only there if you want to reduce the CPU load and not overshoot to lets say 64 threads if you have 32 cores?
There are also extra options available through the turtle file with examples in configs/templates where perhaps owlim:cache-memory and owlim:tuple-index-memory could be useful during loading and the other settings more useful for after loading?
In the end does it also matter if you have 100's of individual files instead of one big turtle file and / or does compressing the files increase loading speed or does it only reduce the initial disk usage?
For me personally, I currently have a setup of 290gb ram and 32 cores and 1.8T raid 0 SSD drives (which will have a backup after loading) and trying to do an initial load of 3 billion triples, from SSD to same SSD, which with the global speed of 16.461 statements per second will take a while but I am not sure if and how to improve this.
Upvotes: 1
Views: 183
Reputation: 1193
The best place to get a reference to the standard data loading speed is the GraphDB benchmark page.
From a computational point of view, the data loading process consists in generating unique internal IDs for all RDF resources and indexing all statements in multiple sorted collections like PSOC, POSC and CPSO (if context indexes are enabled). This process is mainly affected by:
Reasoning complexity - the database integrates a forward chaining inference engine. This means that for every newly added statement a predefined set of rules is fired recursively. Depending on the particular dataset and the configured rules, the number of materialised implicit statements may increase dramatically. The data loading speed is affected by the number of indexed statements, but not input explicit triples.
Size of the dataset - with the increase of the numbered indexed statements in each collection, the time to add more data also increases. The main two factors are the logarithmic complexity of the sorted collection, and the number of page splits because of the random coming IDs in at least one of the collections.
The number of CPU cores will speed up the data loading only if there is inference. The import of every new file will have a minimal overhead, so this should not be a concern unless their size is considerable. For the heap size, we have found best that a combination between SSD and a heap size limited to 30GB works best. If you restrict the heap size to 30GB, then you can benefit from the XX:+UseCompressedOops
and still have a reasonable GC time.
Please note that GraphDB 8.x will also reserve off heap space for immutable data structures like the mapping of RDF resources to internal IDs! For a 3B dataset, it may become as big as 15GB. The main reason behind this design decision is to save GC time.
Upvotes: 2