ilovetolearn
ilovetolearn

Reputation: 2060

How to choose the number of threads for a database batch job

I have created a batch job to extract records from Oracle database. It is reading the records from JDBC and writing to files.

However, the processing is too slow and I would like to increase the no. of threads to perform parallel processing.

I have create a Queue to deposit the record Ids to be processed. I will fire multiple threads to process the records simultaneously.

Based on the total number of records / no of threads, and the time for each record to be processed, I would be able to calculate the time required for process all records.

However, how do I determine the no of threads I can allocate? I believe it is based on the number of CPU cores. I have 8 CPU cores on my server.

Upvotes: 1

Views: 1819

Answers (3)

Raedwald
Raedwald

Reputation: 48702

For database code, such as your's, execution time will be dominated by the database I/O, rather than CPU activity. So in practice only a small number of threads will fully utilize the I/O system. If your database is large (And I guess it is, otherwise you wouldn't be concerned about its performance), and stored on a HDD, the I/O will be vastly more expensive than the CPU processing. That I/O will consist of a sequence of seeks interspersed with sequential I/O, and the HDD can do only one operation at a time. The seeks are enormously expensive, and large seeks are more expensive than small seeks. If there are multiple threads doing I/O, the operating system will interleave their I/O in attempt to be fair to each thread and to provide high performance. The operating system will probably use the elevator algorithm. Using multiple threads for I/O results in multiple locations to seek to and thus the option to have small seeks between those locations. Unfortunately, the optimum number of threads will depend on the database schema, optimizer and the query you are doing. Without the benefit of measurements, I would guess about one thread per table in your query, but at least 2 threads.

Upvotes: 1

GhostCat
GhostCat

Reputation: 140613

Thing is: this is actually tougher than it seems to be. Real world performance tuning is much more than than coming up with a fixed number of threads to use!

First of all, alone figuring programmatically how many cores are there isn't easy ( see my other answer for a list of helpful links ).

Then: it also depends on CPU hardware. Some cores can run only one or two threads in parallel, others maybe 4 or even more.

And of course, you have to understand where your bottle neck is. If you have 32 threads trying to write to the same io device, that could potentially slow down things.

Long story short: in your situation, when you control the whole stack, you simply do experiments. Start with the formula given by the other answer. Then measure the relevant numbers and learn what effects exist.

You followup question implies that you are still a bit "naive" here. When you don't understand that formula, what value could it be to you?

My suggestion: simply go, get that book that the other answers refers to, and then turn to chapter 8.2, name Sizing threadpools and start reading:

The ideal size for a thread pool depends on the types of tasks that will be submitted and the characteristics of the deployment system. Thread pool sizes should rarely be hard coded; instead pool sizes should be provided by a configuration mechanism or computed dynamically by consulting Runtime.availableProcessors.

Sizing thread pools is not an exact science, but fortunately you need only avoid the extremes of "too big" and "too small". If a thread pool is too big, then threads compete for scarce CPU and memory resources, resulting in higher memory usage and possible resource exhaustion. If it is too small, throughput suffers as processors go unused despite available work.

To size a thread pool properly, you need to understand your computing environment, your resource budget, and the nature of your tasks. How many processors does the deployment system have? How much memory? Do tasks perform mostly computation, I/O, or some combination? Do they require a scarce resource, such as a JDBC connection?

If you have different categories of tasks with very different behaviors, consider using multiple thread pools so each can be tuned according to its workload.

Really crucial then:

For compute intensive tasks, an Ncpuprocessor system usually achieves optimum utilization with a thread pool of Ncpu +1 threads.

So, as already explained:

  • This is complicated stuff. You have to know what you are doing.
  • Therefore you have to study this. This isn't something that can be explained in a single answer.
  • On the other hand, you don't need a degree in computer science to get to a working solution. That is where the aforementioned experiments come in.

Upvotes: 2

Hearen
Hearen

Reputation: 7838

There is an equation provided by Brian Goetz in <Concurrency in Practice>:

Nthreads = Ncpu * Ucpu * (1 + W/C)

  • Nthreads is the ideal number of threads;
  • Ncpu = number of CPUs;
  • Ucpu = target CPU utilization, 0 <= Ucpu <= 1;
  • W/C = ratio of wait time to compute time.

Upvotes: 5

Related Questions