Reputation: 48307
I need to run a set of fairly simple functions (average, weighted average/blur, move n spaces down, and similar) on a number of layers, each being a 2D array of primitives (int or float). The inputs will be at most 8k by 4k, but more often smaller than 2k by 1k. The input data is immutable and after every cell has been processed, will be replaced by the output and the whole thing run again. I don't expect it to run in realtime, but do need to process about 10 images per second. All input functions may sample multiple cells, but only write to a single one, and no state is maintained (a fragment shader, for all intents and purposes).
Doing this on the client side, I would offload this to the GPU and have no problems with a few thousand iterations per second. However, this needs to run on a server (probably a VM) with no GPU. The servers will typically be RH-based Linux, but I'd rather avoid adding a C library just for this.
My current test, with the code in this gist, is maxing out around 8 images per second, single-threaded, on a 4k by 2k image. That's the bare minimum for acceptable performance, and multi-threading might make it acceptable (still testing that). JMH gives:
Benchmark Mode Samples Score Score error Units
t.ShaderFunc.testProcess thrpt 40 5.506 0.478 ops/s
t.ShaderFunc.testProcessInline thrpt 40 7.657 0.561 ops/s
t.ShaderFunc.testProcessProc thrpt 40 5.685 0.350 ops/s
With the basic function:
public int blur(final int[][] data, final int x, final int y) {
float accumulator = 0;
int[] col = data[x];
accumulator += col[y];
if (x > 0) {
accumulator += data[x-1][y] * 0.5f;
}
if (x < data.length - 1) {
accumulator += data[x+1][y] * 0.5f;
}
if (y > 0) {
accumulator += col[y-1] * 0.5f;
}
if (y < col.length - 1) {
accumulator += col[y+1] * 0.5f;
}
return Math.round(accumulator / 3f);
}
As one of the simpler functions I'll be dealing with, that has me worried.
I would prefer if the processing function could be an interface (and in Java 8, potentially a lambda) but some initial benchmarks show roughly a 30% performance increase when the processing function is inline in the loop. Given how tight the loops could be, I can imagine call overhead being an issue.
Are there any libraries, intrinsics, or other techniques that would change the performance significantly? Is there any technique that could be used to vectorize this code (I'm unsure if that's possible, given how the functions work)?
Update: After integrating Stuart Marks' changes and running with Aleksey Shipilev's suggestions, the results were quite different. Note that this is on my work machine (a Lenovo W530 with an i7-3840QM, in a CentOS 6.5 VM on VMware Workstation 10) vs my home machine (a Surface Pro 2, on Win 8.1). I'm using JDK 1.8 update 11.
Original gist results:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcess thrpt 40 9.098 0.054 ops/s
c.s.q.ShaderFunc.testProcessInline thrpt 40 11.337 1.603 ops/s
c.s.q.ShaderFunc.testProcessProc thrpt 40 11.706 0.105 ops/s
With the suggested changes (code here):
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcess thrpt 40 40.890 5.772 ops/s
c.s.q.ShaderFunc.testProcessInline thrpt 40 44.032 2.389 ops/s
c.s.q.ShaderFunc.testProcessProc thrpt 40 44.378 2.153 ops/s
This is fantastic performance, well above what I was expecting.
Update 2: After changing the code to more closely resemble my desired API, the overhead for calling with a lambda or class seems to have returned. The code was changed quite a bit to pull in the changes that have been made and clean things up, and I refactored the benchmarks based on some JMH recommendations. Based on the code in this gist, the results are:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 40.860 0.184 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 22.603 0.159 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 22.792 0.117 ops/s
On a VM on the server type I plan to use:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 40.685 0.224 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 16.077 0.113 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 23.827 0.088 ops/s
On the cheapest Digital Ocean node:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderBench.testProcessInline thrpt 200 24.425 0.506 ops/s
c.s.q.ShaderBench.testProcessLambda thrpt 200 9.643 0.140 ops/s
c.s.q.ShaderBench.testProcessProc thrpt 200 13.733 0.134 ops/s
All acceptable performance (on my work machine and dedicated+VM, pretty fantastic performance). I am interested in figuring out why the call has such significant overhead and what can be done to optimize that. Currently experimenting with different sets of parameters, and have posted a more specific question.
Upvotes: 0
Views: 1861
Reputation: 132460
Thanks for providing the gist; it made it quite easy to fiddle around with the code to see the performance effects. Also, +1 for using JMH.
First, here are the baseline results on my machine (2009 MacBook Pro, 2.8GHz Core2Duo, JDK 8u5):
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcess thrpt 5 7.191 1.140 ops/s
c.s.q.ShaderFunc.testProcessInline thrpt 5 7.592 0.465 ops/s
c.s.q.ShaderFunc.testProcessProc thrpt 5 7.326 1.242 ops/s
(c.s.q is com.stackoverflow.questions
)
The differences between the techniques in my runs is smaller, though the errors are somewhat higher, and the inline version is still fastest. Since the results were so close I started off optimizing testProcess
which is the one that makes a direct call to the blur
method, since that's the code you've included here. For other readers' convenience, the code that calls the blur
method is this:
int width = 4000;
int height = 2000;
int[][] nextData = new int[width][height];
for (int i = 0; i < width; ++i) {
for (int j = 0; j < height; ++j) {
nextData[i][j] = blur(blurData, i, j);
}
}
My first observation is that there are a lot of conditionals in the blur
method that avoid stepping off the edges of the matrix. Conveniently, the accumulations that are done at the edges have the same result if the value "off the edge" is zero (I think this true of most image processing kernels). This means that if we pad around the edges of the matrix with zeroes and run the loops from 1 to the limit-1 instead of 0 to limit, we can drop the conditionals. The loop changes to this:
int width = 4002;
int height = 2002;
int[][] nextData = new int[width][height];
for (int i = 1; i < width-1; ++i) {
for (int j = 1; j < height-1; ++j) {
nextData[i][j] = blur(blurData, i, j);
}
}
(You also have to make corresponding changes to the randomMatrix
function that generates the input data.) If you remove the conditionals from the blur
method it now looks like this:
public int blur(final int[][] data, final int x, final int y) {
float accumulator = 0;
int[] col = data[x];
accumulator += col[y];
accumulator += data[x-1][y] * 0.5f;
accumulator += data[x+1][y] * 0.5f;
accumulator += col[y-1] * 0.5f;
accumulator += col[y+1] * 0.5f;
return Math.round(accumulator / 3f);
}
The results for this version are maybe 15% faster:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcess thrpt 5 8.424 1.035 ops/s
Now let's take a closer look at the calculations. The input is all int
data, but we're accumulating into a float
variable. And then the output is an int
as well. Instead of repeated multiplications by 0.5f
we can accumulate double the amount and then divide by 6f
at the end. (There is the possibility of overflow here, though, if the input data is in the 2 billion range.) With some additional simplifications, the revised code looks like this:
public int blur(final int[][] data, final int x, final int y) {
int[] col = data[x];
int accumulator = 2 * col[y]
+ data[x-1][y]
+ data[x+1][y]
+ col[y-1]
+ col[y+1];
return Math.round(accumulator / 6f);
}
And the results are more than 80% faster!
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcess thrpt 5 15.397 1.897 ops/s
With the simplified blur
method, let's reconsider inlining. I won't reproduce the code, since it's just taking the body of the blur
method and doing the obvious refactoring of it into the nested for
loops above (adjusting variable names, etc.) Doing this gives the following results:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcessInline thrpt 5 15.619 1.607 ops/s
Just a little bit faster, but within the margin of error, so it's hard to tell for sure. It might not be worth inlining if keeping the functions separate makes it easier to plug in different algorithms.
The big win here is getting rid of floating point operations, particularly floating point multiplies. Many multi-core systems have more integer than floating-point hardware available, so avoiding FP on such systems will still help.
Ah, that gives me another idea. Can we get rid of the Math.round
call and the FP divide? Again, depending on your input's numeric ranges, we can do integer-based rounding. Instead of
Math.round(accumulator / 6f)
we can do something more-or-less equivalent like:
(1000 * accumulator + 500) / 6000
The results with this change are another 25% improvement!
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcessInline thrpt 5 19.517 2.894 ops/s
Today's lesson: to speed things up, replace floating point with integer computation. Of course, you have to pay attention to overflow and integer truncation issues. But if you can make it work, it's worth it.
UPDATE
In response to a suggestion from Alexey Shipilev (author of JMH) I've run an alternative version that multiplies by the reciprocal of 6.0f instead of dividing. The code is:
static final float ONE_SIXTH = 1.0f / 6.0f;
...
Math.round(accumulator * ONE_SIXTH);
This replaces a floating-point division with a floating-point multiplication. The results are:
Benchmark Mode Samples Score Score error Units
c.s.q.ShaderFunc.testProcessInline thrpt 5 17.144 0.869 ops/s
Noticeably faster than the FP division, but not quite as fast as the integer computation.
Upvotes: 2