Reputation: 11
I'm pulling down a table full of data, and I need to handle this and for every row do a bit of formatting and then push out to a REST API
.
I use a PostgreSQL
database and Java
implementation, the idea is to pull all data down, get the amount of rows and spin up threads to handle a chunk at a time.
I've got the connection set up and pulling the table into a cached row set, and using last()
, getRow()
and beforeFirst()
to get row count.
I'm trying to find a way to split out a chunk of a rowset and hand it off to be handled, but I can't seem to see anything to do this.
There's limit x and things, but I want to avoid numerous database calls with data this size.
Any ideas would be greatly appreciated.
Here's the kind of thing I'm looking at:
RowSet rst = RowSetProvider.newFactory().createCachedRowSet();
rst.setUrl(url);
rst.setUsername(username);
rst.setPassword(password);
String cmd = "select * from event_log";
rst.setCommand(cmd);
rst.execute();
ResultSetMetaData rsmd = rst.getMetaData();
int columnsNumber = rsmd.getColumnCount();
rst.last();
int size = rst.getRow();
int maxPerThread = 1000;
rst.beforeFirst();
int threadsToCreate = size / maxPerThread;
for (int loopCount = 0; loopCount < threadsToCreate; loopCount++) {
//Create chunk
//Create thread
//Pass chunk into thread and start it
//Once chunk is finished then thread and chunk are destroyed
}
Upvotes: 1
Views: 1047
Reputation: 103491
This is the proper way to think about JDBC interactions:
SELECT foo, bar BETWEEN a AND b AS baz FROM foo INNER JOIN whatever;
- this effectively creates a new temporary table.As a consequence:
.next()
, and once you did that, you can't go back. This significantly reduces the load on the DB server, as it doesn't have to be prepared to properly respond to the request to hop back to the start.Here's what I suggest you do:
resultSet.next()
and pull all data into java types by invoking all the various .getFoo(idxOrColName)
methods), marshalling it all into a single java object. I suggest you write a POJO that represents one row's worth of data and create one for each row.You've now reduced the problem to a basic forkjoin style strategy: You have one thread that produces jobs, and you have some code that will take a single job and completes it. I've just described what ExecutorPool and friends are designed to do.
It is crucial that the ResultSet object is not accessible by your processor threads. There is no point to pull rows from the DB in parallel, because the DB isn't parallel and wouldn't be able to give you this information any faster than a single thread. The only parallelising win you can score here is to do the processing of the data in a concurrent fashion, which is why the above model cannot be improved upon without much more drastic changes.
If you're looking for drastic redesigns, you need to 'pre-chunk'. Let's say, for example, that you already know you have a database with a million rows, and each row has the property that it has a completely random ID. You also know you have X processor threads, where X is a dynamic number that depends on many factors, such as how many CPU cores the hardware you run on has.
Then:
You fire up X threads. You tell each thread its index (so, if you have 7 threads, one has 'index 0', another has 'index 1', all the way up to 'index 6'), and how many total threads there are.
Then, each thread runs the following query:
SELECT * FROM jobs WHERE unid % 7 = 5;
That's the query the 6th job thread would run.
This guarantees that each thread is running about an equal number of jobs, give or take.
Generally this is less efficient than the previous model, given that this most likely means the DB is just doing more work (running the same query 7-fold, instead of only once), and any given worker thread may start idling whilst others are still running, vs. the controller-that-pulls-and-hands-jobs-out model where you won't run into the situation that one thread is done whilst others still have lots of jobs left.
NB: RowSet and ResultSet work effectively the exact same way. In fact, the DB version of RowSet (JdbcRowSet) is implemented as a light wrapper around ResultSet.
Upvotes: 4