Reputation: 121780
I have a CSV which is... 34 million lines long. Yes, no joking.
This is a CSV file produced by a parser tracer which is then imported into the corresponding debugging program.
And the problem is in the latter.
Right now I import all rows one by one:
private void insertNodes(final DSLContext jooq)
throws IOException
{
try (
final Stream<String> lines = Files.lines(nodesPath, UTF8);
) {
lines.map(csvToNode)
.peek(ignored -> status.incrementProcessedNodes())
.forEach(r -> jooq.insertInto(NODES).set(r).execute());
}
}
csvToNode
is simply a mapper which will turn a String
(a line of a CSV) into a NodesRecord
for insertion.
Now, the line:
.peek(ignored -> status.incrementProcessedNodes())
well... The method name tells pretty much everything; it increments a counter in status
which reflects the number of rows processed so far.
What happens is that this status
object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
But now jooq has this (taken from their documentation) which can load directly from a CSV:
create.loadInto(AUTHOR)
.loadCSV(inputstream)
.fields(ID, AUTHOR_ID, TITLE)
.execute();
(though personally I'd never use THAT .loadCSV()
overload since it doesn't take the CSV encoding into account).
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
The problem however is that I lose the "by second" information I get from the current code... And if I replace the query by a select count(*) from the_victim_table
, that kind of defeats the point, not to mention that this MAY take a long time.
So, how do I get "the best of both worlds"? That is, is there a way to use an "optimized CSV load" and query, quickly enough and at any time, how many rows have been inserted so far?
(note: should that matter, I currently use H2; a PostgreSQL version is also planned)
Upvotes: 2
Views: 1250
Reputation: 221106
There are a number of ways to optimise this.
One way to optimise query execution at your side is to collect sets of values into:
INSERT INTO t VALUES(1), (2), (3), (4)
)... instead of executing them one by one. This is what the Loader
API also does (see below). All of these measures can heavily increase load speed.
This is the only way you can currently "listen" to loading progress.
(this hasn't been released yet, but it will be, soon)
jOOQ natively implements the above three partitioning measures in jOOQ 3.6
jOOQ will always need to pass through JDBC and might thus not present you with the fastest option. Most databases have their own loading APIs, e.g. the ones you've mentioned:
This will be more low-level, but certainly faster than anything else.
What happens is that this status object is queried every second to get information about the status of the loading process (we are talking about 34 million rows here; they take about 15 minutes to load).
That's a very interesting idea. Will register this as a feature request for the Loader
API: Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time?
though personally I'd never use THAT .loadCSV() overload since it doesn't take the CSV encoding into account
We've fixed that for jOOQ 3.6, thanks to your remarks: https://github.com/jOOQ/jOOQ/issues/4141
And of course JooQ will manage to turn that into a suitable construct so that for this or that DB engine the throughput is maximized.
No, jOOQ doesn't make any assumptions about maximising throughput. This is extremely difficult and depends on many other factors than your DB vendor, e.g.:
jOOQ offers you help in maximising throughput yourself. For instance, in jOOQ 3.5+, you can:
commitXXX()
methods.In jOOQ 3.6+, you can also:
bulkXXX()
methods.batchXXX()
methods.Upvotes: 1