Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?

Question

I have a CSV which is... 34 million lines long. Yes, no joking.

This is a CSV file produced by a parser tracer which is then imported into the corresponding debugging program.

And the problem is in the latter.

Right now I import all rows one by one:

private void insertNodes(final DSLContext jooq)
    throws IOException
{
    try (
        final Stream lines = Files.lines(nodesPath, UTF8);
    ) {
        lines.map(csvToNode)
            .peek(ignored -> status.incrementProcessedNodes())
            .forEach(r -> jooq.insertInto(NODES).set(r).execute());
    }
}

csvToNode is simply a mapper which will turn a String (a line of a CSV) into a NodesRecord for insertion.

Now, the line:

            .peek(ignored -> status.incrementProcessedNodes())

well... The method name tells pretty much everything; it increments a counter in status which reflects the number of rows processed so far.

What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).

But now jooq has this (taken from their documentation) which can load directly from a CSV:

create.loadInto(AUTHOR)
      .loadCSV(inputstream)
      .fields(ID, AUTHOR_ID, TITLE)
      .execute();

(though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account).

And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.

The problem however is that I lose the "by second" information I get from the current code... And if I replace the query by a select count(*) from the_victim_table, that kind of defeats the point, not to mention that this MAY take a long time.

So, how do I get "the best of both worlds"? That is, is there a way to use an "optimized CSV load" and query, quickly enough and at any time, how many rows have been inserted so far?

(note: should that matter, I currently use H2; a PostgreSQL version is also planned)

Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?

Answers (1)

Custom load partitioning

Load partitioning using jOOQ 3.6+

Using vendor-specific CSV loading mechanisms

General remarks

Related Questions

Using JooQ to &quot;batch insert&quot; from a CSV _and_ keep track of inserted records at the same time?

Answers (1)

Custom load partitioning

Load partitioning using jOOQ 3.6+

Using vendor-specific CSV loading mechanisms

General remarks

Related Questions

Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?