Reputation: 15769
When trying to load a current Wikidata dump as documented in Get Your Own Copy of WikiData by following the procedure describe in https://muncca.com/2019/02/14/wikidata-import-in-apache-jena/ i am running into some performance problems and limits of Apache Jenas tdbloader commands.
There seem to be two versions of it:
The name tdbloader2 for the TDB1 tdbloader is confusing and led to its usage as a first attempt.
The experience with TDB1/tdbloader2 was that the loading went quite well for the first few billion triples.
The speed was 150 k triples/second initially. It then fell to some 100k triples/second at around 9 billion triples. At 10 billion triples the speed dropped to 15000 triples/second at around 10 billion triples and stayed around 5000 triples/second when moving towards 11 billion triples.
I had expected the import to have finished by then so currently i am even doubting that the progress is counting triples but instead lines of turtle input which might not be the same since the input has some 15 billion lines but only some 11 billion triples are expected.
Since the import already ran for 3.5 days at this point i had to make a decision whether to abort it and look for better import options or simply wait for a while.
So i placed this question on stackoverflow. Based on AndyS's hint that there are two versions of tdbloader i aborted the TDB1 import after some 4.5 days and over 11 billion triples having been reported to be imported in the phase "data". The performance was down to 2.3 k triples/second at that point.
With the modified script using tdb2.tdbloader the import has been running again for multiple attempts as documented in the wiki. Two import tdb2.tdbloader attempts already failed with crashing Java VMs so I changed the hardware from my MacPro to the old linux box again (which is unfortunately slower) and later back again.
I changed the java virtual machine to a recent OpenJDK after the older Oracle JVM crashed in a first attempt with tdb2.tdbloader. This Java VM crashed with the same symptomps # Internal Error (safepoint.cpp:310), see e.g. https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8169477
For the attempts with tdb2.tdbloader I'll assume that 15.7 billion triples need to be imported (one per line of the turtle file). For a truthy dataset the number of triples would be some 13 billion triples.
If you look a the performance results shown in the wiki article you'll find that when there is a logarithmic performance degradation. For rotating disks the degradation is so bad that it makes the import take so long it's not worthwhile waiting for the result (we are talking multiple months here ...)
In the diagram below both axes have a logarithmig scale. The x-axis shows the log of the total number of triples imported (up to 3 billion when the import was aborted) The y-axis shows the log of the batch / avg sizes - the number of triples imported in a given time frame. The more triples are imported the slower things get from a top 300.000 triples per second to as low as only 300 triples per second. With the 4 th attempt the performance was some 1k triples/second after 11 days and some 20% of the data imported. This would mean an estimated time of finish of the import after 230 days - given the degradation of the speed probably quite a bit longer (more than a year).
The target database size was 320 GByte so hopefully the result would fit in the 4 TerraByte of disk space allocated for the target and is not the limiting factor.
Since Jonas Sourlier reported on his success after some 7 days with an SSD disk i finally asked my project lead for financing a 4 TB SSD disk and lend it to me for experiments. With that disk a fifth attempt was successful now for the truthy dataset some 5.2 billion triples where imported after about 4 1/2 days. The bad news is that this is exactly what i didn't want - i had hoped to solve the problem by software and configuration settings and not by throwing quicker and more costly hardware at the problem. Nevertheless here is the diagram for this import:
I intend to import the full 12 billion triples soon and for that it would still be good to know how to improve the speed with software / configuration settings or other non hardware approaches.
I did not tune the Java VM Args or split the files yet as mentioned Apache Users mailing list discussion of end of 2017
The current import speed is obviously inacceptable. On the other hand heavily investing in extra hardware is not an option due to a limited budget.
There are some questions that are not answered by the links you'll find in the wiki article mentioned above:
What is proven to speed up the import without investing into extra hardware?
e.g. splitting the files, changing VM arguments, running multiple processes ...
What explains the decreasing speed at higher numbers of triples and how can this be avoided?
What successful multi-billion triple imports for Jena do you know of and what are the circumstances for these?
Upvotes: 5
Views: 1047