Reputation: 11

Java multi-threaded handling of a rowset

I'm pulling down a table full of data, and I need to handle this and for every row do a bit of formatting and then push out to a REST API.

I use a PostgreSQL database and Java implementation, the idea is to pull all data down, get the amount of rows and spin up threads to handle a chunk at a time.

I've got the connection set up and pulling the table into a cached row set, and using last(), getRow() and beforeFirst() to get row count.

I'm trying to find a way to split out a chunk of a rowset and hand it off to be handled, but I can't seem to see anything to do this.

There's limit x and things, but I want to avoid numerous database calls with data this size.

Any ideas would be greatly appreciated.

Here's the kind of thing I'm looking at:

RowSet rst = RowSetProvider.newFactory().createCachedRowSet();
rst.setUrl(url);
rst.setUsername(username);
rst.setPassword(password);

String cmd = "select * from event_log";

rst.setCommand(cmd);
rst.execute();

ResultSetMetaData rsmd = rst.getMetaData();
int columnsNumber = rsmd.getColumnCount();

rst.last();
int size = rst.getRow();
int maxPerThread = 1000;
rst.beforeFirst();

int threadsToCreate = size / maxPerThread;

for (int loopCount = 0; loopCount < threadsToCreate; loopCount++) {
    //Create chunk
    //Create thread
    //Pass chunk into thread and start it
    //Once chunk is finished then thread and chunk are destroyed
}

Upvotes: 1

Answers (1)

rzwitserloot

Reputation: 103491

This is the proper way to think about JDBC interactions:

All queries are like an ad-hoc view: SELECT foo, bar BETWEEN a AND b AS baz FROM foo INNER JOIN whatever; - this effectively creates a new temporary table.
A ResultSet is a live interactive concept: A ResultSet is not a dump of the returned data. It is like the relationship between a FileInputStream and a file on disk: The ResultSet has methods that give you data, and that data is probably obtained by chatting to the database, 'live', to obtain this information. The ResultSet itself only has a few handles, and not actual data, though it may do some caching, you have no idea.

As a consequence:

ResultSet is utterly non-parallellizable. If you share a ResultSet object with more than one thread, you wrote a bug, and you can't recover from there.
In many DBs, 'ask for the length' is tantamount to running the entire query soup to nuts, and is therefore quite slow. You probably don't want to do that, and there is no real reason to do that from the perspective of 'I want to concurrently process the information I receieved'. You've picked the wrong approach.
ResultSets can (and generally, for performance reasons, should be!) configured as 'forward only', meaning: You can advance by one row by calling .next(), and once you did that, you can't go back. This significantly reduces the load on the DB server, as it doesn't have to be prepared to properly respond to the request to hop back to the start.

Here's what I suggest you do:

You have a single 'controller' thread which has the ResultSet and runs the query.
Once the query returns, you have no idea how many records you do have. But you do know how much you want to parallelize - how many threads you want to be concurrently churning away at processing this data.
Thus, the answer: Spin up that many threads in the form of an ExecutorPool. Then, have your controller pull rows (call resultSet.next() and pull all data into java types by invoking all the various .getFoo(idxOrColName) methods), marshalling it all into a single java object. I suggest you write a POJO that represents one row's worth of data and create one for each row.
Then, your controller thread will take this object and considers this object 'a job'.

You've now reduced the problem to a basic forkjoin style strategy: You have one thread that produces jobs, and you have some code that will take a single job and completes it. I've just described what ExecutorPool and friends are designed to do.

It is crucial that the ResultSet object is not accessible by your processor threads. There is no point to pull rows from the DB in parallel, because the DB isn't parallel and wouldn't be able to give you this information any faster than a single thread. The only parallelising win you can score here is to do the processing of the data in a concurrent fashion, which is why the above model cannot be improved upon without much more drastic changes.

If you're looking for drastic redesigns, you need to 'pre-chunk'. Let's say, for example, that you already know you have a database with a million rows, and each row has the property that it has a completely random ID. You also know you have X processor threads, where X is a dynamic number that depends on many factors, such as how many CPU cores the hardware you run on has.

Then:

You fire up X threads. You tell each thread its index (so, if you have 7 threads, one has 'index 0', another has 'index 1', all the way up to 'index 6'), and how many total threads there are.

Then, each thread runs the following query:

SELECT * FROM jobs WHERE unid % 7 = 5;

That's the query the 6th job thread would run.

This guarantees that each thread is running about an equal number of jobs, give or take.

Generally this is less efficient than the previous model, given that this most likely means the DB is just doing more work (running the same query 7-fold, instead of only once), and any given worker thread may start idling whilst others are still running, vs. the controller-that-pulls-and-hands-jobs-out model where you won't run into the situation that one thread is done whilst others still have lots of jobs left.

NB: RowSet and ResultSet work effectively the exact same way. In fact, the DB version of RowSet (JdbcRowSet) is implemented as a light wrapper around ResultSet.

Upvotes: 4

Java multi-threaded handling of a rowset

Answers (1)

Related Questions