Server-side image processing

Question

I need to run a set of fairly simple functions (average, weighted average/blur, move n spaces down, and similar) on a number of layers, each being a 2D array of primitives (int or float). The inputs will be at most 8k by 4k, but more often smaller than 2k by 1k. The input data is immutable and after every cell has been processed, will be replaced by the output and the whole thing run again. I don't expect it to run in realtime, but do need to process about 10 images per second. All input functions may sample multiple cells, but only write to a single one, and no state is maintained (a fragment shader, for all intents and purposes).

Doing this on the client side, I would offload this to the GPU and have no problems with a few thousand iterations per second. However, this needs to run on a server (probably a VM) with no GPU. The servers will typically be RH-based Linux, but I'd rather avoid adding a C library just for this.

My current test, with the code in this gist, is maxing out around 8 images per second, single-threaded, on a 4k by 2k image. That's the bare minimum for acceptable performance, and multi-threading might make it acceptable (still testing that). JMH gives:

Benchmark                          Mode   Samples        Score  Score error    Units
t.ShaderFunc.testProcess          thrpt        40        5.506        0.478    ops/s
t.ShaderFunc.testProcessInline    thrpt        40        7.657        0.561    ops/s
t.ShaderFunc.testProcessProc      thrpt        40        5.685        0.350    ops/s

With the basic function:

public int blur(final int[][] data, final int x, final int y) {
    float accumulator = 0;
    int[] col = data[x];
    accumulator += col[y];
    if (x > 0) {
        accumulator += data[x-1][y] * 0.5f;
    }
    if (x < data.length - 1) {
        accumulator += data[x+1][y] * 0.5f;
    }
    if (y > 0) {
        accumulator += col[y-1] * 0.5f;
    }
    if (y < col.length - 1) {
        accumulator += col[y+1] * 0.5f;
    }
    return Math.round(accumulator / 3f);
}

As one of the simpler functions I'll be dealing with, that has me worried.

I would prefer if the processing function could be an interface (and in Java 8, potentially a lambda) but some initial benchmarks show roughly a 30% performance increase when the processing function is inline in the loop. Given how tight the loops could be, I can imagine call overhead being an issue.

Are there any libraries, intrinsics, or other techniques that would change the performance significantly? Is there any technique that could be used to vectorize this code (I'm unsure if that's possible, given how the functions work)?

Update: After integrating Stuart Marks' changes and running with Aleksey Shipilev's suggestions, the results were quite different. Note that this is on my work machine (a Lenovo W530 with an i7-3840QM, in a CentOS 6.5 VM on VMware Workstation 10) vs my home machine (a Surface Pro 2, on Win 8.1). I'm using JDK 1.8 update 11.

Original gist results:

Benchmark                              Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcess          thrpt        40        9.098        0.054    ops/s
c.s.q.ShaderFunc.testProcessInline    thrpt        40       11.337        1.603    ops/s
c.s.q.ShaderFunc.testProcessProc      thrpt        40       11.706        0.105    ops/s

With the suggested changes (code here):

Benchmark                              Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcess          thrpt        40       40.890        5.772    ops/s
c.s.q.ShaderFunc.testProcessInline    thrpt        40       44.032        2.389    ops/s
c.s.q.ShaderFunc.testProcessProc      thrpt        40       44.378        2.153    ops/s

This is fantastic performance, well above what I was expecting.

Update 2: After changing the code to more closely resemble my desired API, the overhead for calling with a lambda or class seems to have returned. The code was changed quite a bit to pull in the changes that have been made and clean things up, and I refactored the benchmarks based on some JMH recommendations. Based on the code in this gist, the results are:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       40.860        0.184    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200       22.603        0.159    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       22.792        0.117    ops/s

On a VM on the server type I plan to use:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       40.685        0.224    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200       16.077        0.113    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       23.827        0.088    ops/s

On the cheapest Digital Ocean node:

Benchmark                               Mode   Samples        Score  Score error    Units
c.s.q.ShaderBench.testProcessInline    thrpt       200       24.425        0.506    ops/s
c.s.q.ShaderBench.testProcessLambda    thrpt       200        9.643        0.140    ops/s
c.s.q.ShaderBench.testProcessProc      thrpt       200       13.733        0.134    ops/s

All acceptable performance (on my work machine and dedicated+VM, pretty fantastic performance). I am interested in figuring out why the call has such significant overhead and what can be done to optimize that. Currently experimenting with different sets of parameters, and have posted a more specific question.

Stuart Marks · Accepted Answer

Thanks for providing the gist; it made it quite easy to fiddle around with the code to see the performance effects. Also, +1 for using JMH.

First, here are the baseline results on my machine (2009 MacBook Pro, 2.8GHz Core2Duo, JDK 8u5):

Benchmark                              Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcess          thrpt         5        7.191        1.140    ops/s
c.s.q.ShaderFunc.testProcessInline    thrpt         5        7.592        0.465    ops/s
c.s.q.ShaderFunc.testProcessProc      thrpt         5        7.326        1.242    ops/s

(c.s.q is com.stackoverflow.questions)

The differences between the techniques in my runs is smaller, though the errors are somewhat higher, and the inline version is still fastest. Since the results were so close I started off optimizing testProcess which is the one that makes a direct call to the blur method, since that's the code you've included here. For other readers' convenience, the code that calls the blur method is this:

int width = 4000;
int height = 2000;
int[][] nextData = new int[width][height];
for (int i = 0; i < width; ++i) {
    for (int j = 0; j < height; ++j) {
        nextData[i][j] = blur(blurData, i, j);
    }
}

My first observation is that there are a lot of conditionals in the blur method that avoid stepping off the edges of the matrix. Conveniently, the accumulations that are done at the edges have the same result if the value "off the edge" is zero (I think this true of most image processing kernels). This means that if we pad around the edges of the matrix with zeroes and run the loops from 1 to the limit-1 instead of 0 to limit, we can drop the conditionals. The loop changes to this:

int width = 4002;
int height = 2002;
int[][] nextData = new int[width][height];
for (int i = 1; i < width-1; ++i) {
    for (int j = 1; j < height-1; ++j) {
        nextData[i][j] = blur(blurData, i, j);
    }
}

(You also have to make corresponding changes to the randomMatrix function that generates the input data.) If you remove the conditionals from the blur method it now looks like this:

public int blur(final int[][] data, final int x, final int y) {
    float accumulator = 0;
    int[] col = data[x];
    accumulator += col[y];
    accumulator += data[x-1][y] * 0.5f;
    accumulator += data[x+1][y] * 0.5f;
    accumulator += col[y-1] * 0.5f;
    accumulator += col[y+1] * 0.5f;
    return Math.round(accumulator / 3f);
}

The results for this version are maybe 15% faster:

Benchmark                        Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcess    thrpt         5        8.424        1.035    ops/s

Now let's take a closer look at the calculations. The input is all int data, but we're accumulating into a float variable. And then the output is an int as well. Instead of repeated multiplications by 0.5f we can accumulate double the amount and then divide by 6f at the end. (There is the possibility of overflow here, though, if the input data is in the 2 billion range.) With some additional simplifications, the revised code looks like this:

public int blur(final int[][] data, final int x, final int y) {
    int[] col = data[x];
    int accumulator = 2 * col[y]
                      + data[x-1][y]
                      + data[x+1][y]
                      + col[y-1]
                      + col[y+1];
    return Math.round(accumulator / 6f);
}

And the results are more than 80% faster!

Benchmark                        Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcess    thrpt         5       15.397        1.897    ops/s

With the simplified blur method, let's reconsider inlining. I won't reproduce the code, since it's just taking the body of the blur method and doing the obvious refactoring of it into the nested for loops above (adjusting variable names, etc.) Doing this gives the following results:

Benchmark                              Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcessInline    thrpt         5       15.619        1.607    ops/s

Just a little bit faster, but within the margin of error, so it's hard to tell for sure. It might not be worth inlining if keeping the functions separate makes it easier to plug in different algorithms.

The big win here is getting rid of floating point operations, particularly floating point multiplies. Many multi-core systems have more integer than floating-point hardware available, so avoiding FP on such systems will still help.

Ah, that gives me another idea. Can we get rid of the Math.round call and the FP divide? Again, depending on your input's numeric ranges, we can do integer-based rounding. Instead of

Math.round(accumulator / 6f)

we can do something more-or-less equivalent like:

(1000 * accumulator + 500) / 6000

The results with this change are another 25% improvement!

Benchmark                              Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcessInline    thrpt         5       19.517        2.894    ops/s

Today's lesson: to speed things up, replace floating point with integer computation. Of course, you have to pay attention to overflow and integer truncation issues. But if you can make it work, it's worth it.

UPDATE

In response to a suggestion from Alexey Shipilev (author of JMH) I've run an alternative version that multiplies by the reciprocal of 6.0f instead of dividing. The code is:

static final float ONE_SIXTH = 1.0f / 6.0f;

...

Math.round(accumulator * ONE_SIXTH);

This replaces a floating-point division with a floating-point multiplication. The results are:

Benchmark                              Mode   Samples        Score  Score error    Units
c.s.q.ShaderFunc.testProcessInline    thrpt         5       17.144        0.869    ops/s

Noticeably faster than the FP division, but not quite as fast as the integer computation.

Server-side image processing

Answers (1)

Related Questions