Perraco
Perraco

Reputation: 17360

Optimize a nearest neighbor resizing algorithm for speed

I'm using the next algorithm to perform nearest neighbor resizing. Is there anyway to optimize it's speed? Input and Output buffers are in ARGB format, though images are known to be always opaque. Thank you.

void resizeNearestNeighbor(const uint8_t* input, uint8_t* output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
    const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
    const int y_ratio = (int)((sourceHeight << 16) / targetHeight) ;
    const int colors = 4;

    for (int y = 0; y < targetHeight; y++)
    {
        int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
        int i_xdest = y * targetWidth;

        for (int x = 0; x < targetWidth; x++)
        {
            int x2 = ((x * x_ratio) >> 16) ;
            int y2_x2_colors = (y2_xsource + x2) * colors;
            int i_x_colors = (i_xdest + x) * colors;

            output[i_x_colors]     = input[y2_x2_colors];
            output[i_x_colors + 1] = input[y2_x2_colors + 1];
            output[i_x_colors + 2] = input[y2_x2_colors + 2];
            output[i_x_colors + 3] = input[y2_x2_colors + 3];
        }
    }
}

Upvotes: 0

Views: 1897

Answers (4)

Moby Disk
Moby Disk

Reputation: 3861

I hope I didn't break anything. This combines some of the suggestions posted thus far and is about 30% faster. I'm amazed that is all we got. I did not actually check the destination image to see if it was right.

Changes: - remove multiplies from inner loop (10% improvement) - uint32_t instead of uint8_t (10% improvement) - __restrict keyword (1% improvement)

This was on an i7 x64 machine running Windows, compiled with MSVC 2013. You will have to change the __restrict keyword for other compilers.

void resizeNearestNeighbor2_32(const uint8_t* __restrict input, uint8_t* __restrict output, int sourceWidth, int sourceHeight, int targetWidth, int targetHeight)
{
    const uint32_t* input32 = (const uint32_t*)input;
    uint32_t* output32 = (uint32_t*)output;

    const int x_ratio = (int)((sourceWidth << 16) / targetWidth);
    const int y_ratio = (int)((sourceHeight << 16) / targetHeight);

    int x_ratio_with_color = x_ratio;

    for (int y = 0; y < targetHeight; y++)
    {
        int y2_xsource = ((y * y_ratio) >> 16) * sourceWidth;
        int i_xdest = y * targetWidth;

        int source_x_offset = 0;
        int startingOffset = y2_xsource;
        const uint32_t * inputLine = input32 + startingOffset;
        for (int x = 0; x < targetWidth; x++)
        {
            i_xdest += 1;
            source_x_offset += x_ratio_with_color;
            int sourceOffset = source_x_offset >> 16;

            output[i_xdest] = inputLine[sourceOffset];
        }
    }
}

Upvotes: 0

Diniden
Diniden

Reputation: 1125

The algorithm is fine, but you can utilize massive parallelization by submitting your image to the GPU. If you use opengl, simply creating a context of the new size and providing a properly sized quad can give you inherent nearest neighbor calculations. Also opengl could give you access to other resizing sampling techniques by simply changing the properties of the texture you read from (which would amount to a single gl command which could be an easy paramter to your resize function).

Also later in development, you could simply swap out a shader for other blending techniques which also keeps you utilizing your wonderful GPU processor of image processing glory.

Also, since you aren't using any fancy geometry it can become almost trivial to write the program. It would be a little more involved than your algorithm, but it could perform magnitudes faster depending on image size.

Upvotes: 0

user1196549
user1196549

Reputation:

There's little that you can do to speed this up, as you already arranged the loops in the right order and cleverly used fixed-point arithmetic. As others suggested, try to move the 32 bits in a single go (hoping that the compiler didn't see that yet).

In case of significant enlargement, there is a possibility: you can determine how many times every source pixel needs to be replicated (you'll need to work on the properties of the relation Xd=Wd.Xs/Ws in integers), and perform a single pixel read for k writes. This also works on the y's, and you can memcpy the identical rows instead of recomputing them. You can precompute and tabulate the mappings of the X's and Y's using run-length coding.

But there is a barrier that you will not pass: you need to fill the destination image.

If you are desperately looking for speedup, there could remain the option of using vector operations (SEE or AVX) to handle several pixels at a time. Shuffle instructions are available that might enable to control the replication (or decimation) of the pixels. But due to the complicated replication pattern combined with the fixed structure of the vector registers, you will probably need to integrate a complex decision table.

Upvotes: 1

user3528438
user3528438

Reputation: 2817

restrict keyword will help a lot, assuming no aliasing.

Another improvement is to declare another pointerToOutput and pointerToInput as uint_32_t, so that the four 8-bit copy-assignments can be combined into a 32-bit one, assuming pointers are 32bit aligned.

Upvotes: 1

Related Questions