Reputation: 1278
I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.
For performance, the table is unsigned integers for every pixel.
When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.
The simple math expression for the computation is:
sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.
In this case, sat_
is a 1-D pointer of unsigned integers representing the SAT, and buff_
is an 8-bit unsigned monochrome buffer.
My implementation looks like the following:
uint *pSat = sat_;
char *pBuff = buff_;
for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
uint curr = 0;
for (uint x = 0; x < width; x += 4)
{
pSat[x + 0] = curr += pBuff[x + 0];
pSat[x + 1] = curr += pBuff[x + 1];
pSat[x + 2] = curr += pBuff[x + 2];
pSat[x + 3] = curr += pBuff[x + 3];
}
}
The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.
Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.
Upvotes: 1
Views: 2423
Reputation: 783
Algorithmically, I don't think there is anything you can do to optimize this further. Even though you didn't use the term OLAP cube in your description, you are basically just building an OLAP cube. The code you have is the standard approach to building an OLAP cube.
If you give details about the hardware you're working with, there might be some optimizations available. For example, there is a GPU programming approach that may or may not be faster. Note: Another post on this thread mentioned that parallelization is not possible. This isn't necessarily true... Your algorithm can't be implemented in parallel, but there are algorithms that maintain data-level parallelism, which could be exploited with a GPU approach.
Upvotes: 1
Reputation: 272657
You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.
But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*
However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.
Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.
Upvotes: 2