fge
fge

Reputation: 121780

Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?

I have a CSV which is... 34 million lines long. Yes, no joking.

This is a CSV file produced by a parser tracer which is then imported into the corresponding debugging program.

And the problem is in the latter.

Right now I import all rows one by one:

private void insertNodes(final DSLContext jooq)
    throws IOException
{
    try (
        final Stream<String> lines = Files.lines(nodesPath, UTF8);
    ) {
        lines.map(csvToNode)
            .peek(ignored -> status.incrementProcessedNodes())
            .forEach(r -> jooq.insertInto(NODES).set(r).execute());
    }
}

csvToNode is simply a mapper which will turn a String (a line of a CSV) into a NodesRecord for insertion.

Now, the line:

            .peek(ignored -> status.incrementProcessedNodes())

well... The method name tells pretty much everything; it increments a counter in status which reflects the number of rows processed so far.

What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).

But now jooq has this (taken from their documentation) which can load directly from a CSV:

create.loadInto(AUTHOR)
      .loadCSV(inputstream)
      .fields(ID, AUTHOR_ID, TITLE)
      .execute();

(though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account).

And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.

The problem however is that I lose the "by second" information I get from the current code... And if I replace the query by a select count(*) from the_victim_table, that kind of defeats the point, not to mention that this MAY take a long time.

So, how do I get "the best of both worlds"? That is, is there a way to use an "optimized CSV load" and query, quickly enough and at any time, how many rows have been inserted so far?

(note: should that matter, I currently use H2; a PostgreSQL version is also planned)

Upvotes: 2

Views: 1250

Answers (1)

Lukas Eder
Lukas Eder

Reputation: 221106

There are a number of ways to optimise this.

Custom load partitioning

One way to optimise query execution at your side is to collect sets of values into:

  • Bulk statements (as in INSERT INTO t VALUES(1), (2), (3), (4))
  • Batch statements (as in JDBC batch)
  • Commit segments (commit after N statements)

... instead of executing them one by one. This is what the Loader API also does (see below). All of these measures can heavily increase load speed.

This is the only way you can currently "listen" to loading progress.

Load partitioning using jOOQ 3.6+

(this hasn't been released yet, but it will be, soon)

jOOQ natively implements the above three partitioning measures in jOOQ 3.6

Using vendor-specific CSV loading mechanisms

jOOQ will always need to pass through JDBC and might thus not present you with the fastest option. Most databases have their own loading APIs, e.g. the ones you've mentioned:

This will be more low-level, but certainly faster than anything else.

General remarks

What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).

That's a very interesting idea. Will register this as a feature request for the Loader API: Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?

though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account

We've fixed that for jOOQ 3.6, thanks to your remarks: https://github.com/jOOQ/jOOQ/issues/4141

And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.

No, jOOQ doesn't make any assumptions about maximising throughput. This is extremely difficult and depends on many other factors than your DB vendor, e.g.:

  • Constraints on the table
  • Indexes on the table
  • Logging turned on/off
  • etc.

jOOQ offers you help in maximising throughput yourself. For instance, in jOOQ 3.5+, you can:

  • Set the commit rate (e.g. commit every 1000 rows) to avoid long UNDO / REDO logs in case you're inserting with logging turned on. This can be done via the commitXXX() methods.

In jOOQ 3.6+, you can also:

  • Set the bulk statement rate (e.g. combine 10 rows in a single statement) to drastically speed up execution. This can be done via the bulkXXX() methods.
  • Set the batch statement rate (e.g. combine 10 statements in a single JDBC batch) to drastically speed up execution (see this blog post for details). This can be done via the batchXXX() methods.

Upvotes: 1

Related Questions