matentzn
matentzn

Reputation: 357

Why is adding an RDF dump (InputStream) to a RDF4J repository so slow (in Java)?

I am loading an RDF rump from the web as an InputStream, which contains between 120 and 1500 triples. On average, clearing the context takes about half a second, while adding the triples takes around 74 seconds for the 120 triples. The physical file size of the RDFXML serialisation is between 6KB and 195KB.

InputStream input = ...
try (RepositoryConnection conn = db.getConnection()) {
    try {
        conn.clear(context);
        conn.add(input, "", RDFFormat.RDFXML, context);
    } catch (Exception e) {
        e.printStackTrace();
    } 
}

The repository is initialised as follows:

RemoteRepositoryManager manager = new RemoteRepositoryManager(serverUrl);
manager.initialize();
db = manager.getRepository("repo");

Upvotes: 1

Views: 249

Answers (1)

jschnasse
jschnasse

Reputation: 9588

You could try the following:

  1. Check internet downstream, e.g. test same code with a local file.
  2. Check internet upstream, e.g. use an in-memory repo Repository repo = new SailRepository(new MemoryStore());
  3. Give your java app enough memory using -Xmx in JAVA_OPTS
  4. Not sure what conn.clear(context); is intended to do. By my understanding it will remove all triples in the context?

From my place it lasts around 5min to load 10,000,000 Triples of a 2.7G rdf dump from wikidata to an in-memory repo (I run with maven tests with export MAVEN_OPTS=-Xmx7000m). This makes ~33333 Triples per sec - if I calculated right ;-).

@Test
public void variant3() throws MalformedURLException, IOException {
    Repository repo = new SailRepository(new MemoryStore());
    repo.initialize();
    IRI context = repo.getValueFactory().createIRI("info/mycontext:context1");
    RDFFormat format = RDFFormat.NTRIPLES;
    System.out.println("Load zip file of format " + format);
    try (InputStream in = new URL(
                    "https://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/wikidata-terms.nt.gz")
                                    .openConnection().getInputStream();
                    NotifyingRepositoryConnectionWrapper con = new NotifyingRepositoryConnectionWrapper(repo,
                                    repo.getConnection());) {
        RepositoryConnectionListenerAdapter myListener = new RepositoryConnectionListenerAdapter() {
            private long count = 0;
            @Override
            public void add(RepositoryConnection arg0, Resource arg1, IRI arg2, Value arg3, Resource... arg4) {
                count++;
                if (count % 100000 == 0)
                    System.out.println("Add statement number " + count + "\n" + arg1 + " " + arg2 + " " + arg3);
            }
        };
        con.addRepositoryConnectionListener(myListener);
        con.add(in, "", format,context);
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

Upvotes: 3

Related Questions