Best settings for bulk load in graphdb

Question

I have been going through documentation but I am unable to identify what the general guidelines are for bulk loading.

As far as I can see the best way to bulk load data into graphdb is by using the LoadRDF tool.

However the general rules for the appropriate settings are not familiar to me. First of all if you have an "average" server with an SSD drive what kind of parsing speed is acceptable? 1.000 statements / sec, 10.000 statements / sec or is it much more or less?

Also what are good settings? For example you can set the -Dpool.buffer.size which has a default of 200.000 statements but if you have 10gig of ram what would be the rule of thumb to increase this and if you have 100 or 300 gig of ram?

Another option is the -Dinfer.pool.size which is set to the maximum of threads as there are cpus with a minimum of 4. Thus 1 core = 4 threads and 32 cores is 32 threads. I think this does not require any extra tuning or is this only there if you want to reduce the CPU load and not overshoot to lets say 64 threads if you have 32 cores?

There are also extra options available through the turtle file with examples in configs/templates where perhaps owlim:cache-memory and owlim:tuple-index-memory could be useful during loading and the other settings more useful for after loading?

In the end does it also matter if you have 100's of individual files instead of one big turtle file and / or does compressing the files increase loading speed or does it only reduce the initial disk usage?

For me personally, I currently have a setup of 290gb ram and 32 cores and 1.8T raid 0 SSD drives (which will have a backup after loading) and trying to do an initial load of 3 billion triples, from SSD to same SSD, which with the global speed of 16.461 statements per second will take a while but I am not sure if and how to improve this.

Best settings for bulk load in graphdb

Answers (1)

Related Questions