Reputation: 123

How to most efficiently modify R / G / B values?

So I wanted to implement lighting in my pixel based rendering system, googled and found out to display R / G / B values lighter or darker I have to multiply each red green and blue value by a number < 1 to display it darker and by a number > 1 to display it lighter.

So I implemented it like this, but its really dragging down my performance since I have to do this for each pixel:

void PixelRenderer::applyLight(Uint32& color){
    Uint32 alpha = color >> 24;
    alpha << 24;
    alpha >> 24;

    Uint32 red = color >> 16;
    red = red << 24;
    red = red >> 24;

    Uint32 green = color >> 8;
    green = green << 24;
    green = green >> 24;

    Uint32 blue = color;
    blue = blue << 24;
    blue = blue >> 24;

    red = red * 0.5;
    green = green * 0.5;
    blue = blue * 0.5;
    color = alpha << 24 | red << 16 | green << 8 | blue;
}

Any ideas or examples on how to improve the speed?

Upvotes: 3

Answers (5)

Persixty

Reputation: 8589

To preserve the alpha value in the front use:

(color>>1)&0x7F7F7F | (color&0xFF000000)

(A tweak on what Wimmel offered in the comments).

I think the 'learning curve' here is that you were using shift and shift back to mask out bits. You should use & with a masking value.

For a more general solution (where 0.0<=factor<=1.0) :

void PixelRenderer::applyLight(Uint32& color, double factor){
    Uint32 alpha=color&0xFF000000;
    Uint32 red=  (color&0x00FF0000)*factor;
    Uint32 green= (color&0x0000FF00)*factor;
    Uint32 blue=(color&0x000000FF)*factor;

   color=alpha|(red&0x00FF0000)|(green&0x0000FF00)|(blue&0x000000FF);
}

Notice there is no need to shift the components down to the low order bits before performing the multiplication.

Ultimately you may find that the bottleneck is floating point conversions and arithmetic.

To reduce that you should consider either:

Reduce it to a scaling factor for example in the range 0-256.
Precompute factor*component as a 256 element array and 'pick' the components out oft.

I'm proposing a range of 257 because you can achieve the factor as follows:

For a more general solution (where 0<=factor<=256) :

void PixelRenderer::applyLight(Uint32& color, Uint32 factor){
    Uint32 alpha=color&0xFF000000;
    Uint32 red=  ((color&0x00FF0000)*factor)>>8;
    Uint32 green= ((color&0x0000FF00)*factor)>>8;
    Uint32 blue=((color&0x000000FF)*factor)>>8;

    color=alpha|(red&0x00FF0000)|(green&0x0000FF00)|(blue&0x000000FF);
}

Here's a runnable program illustrating the first example:

#include <stdio.h>
#include <inttypes.h>

typedef uint32_t Uint32;

Uint32 make(Uint32 alpha,Uint32 red,Uint32 green,Uint32 blue){
    return (alpha<<24)|(red<<16)|(green<<8)|blue;
}

void output(Uint32 color){
    printf("alpha=%"PRIu32" red=%"PRIu32" green=%"PRIu32" blue=%"PRIu32"\n",(color>>24),(color&0xFF0000)>>16,(color&0xFF00)>>8,color&0xFF);
}

Uint32 applyLight(Uint32 color, double factor){
    Uint32 alpha=color&0xFF000000;
    Uint32 red=  (color&0x00FF0000)*factor;
    Uint32 green= (color&0x0000FF00)*factor;
    Uint32 blue=(color&0x000000FF)*factor;

    return alpha|(red&0x00FF0000)|(green&0x0000FF00)|(blue&0x000000FF);
}

int main(void) {
    Uint32 color1=make(156,100,50,20);
    Uint32 result1=applyLight(color1,0.9);
    output(result1);

    Uint32 color2=make(255,255,255,255);
    Uint32 result2=applyLight(color2,0.1);
    output(result2);

    Uint32 color3=make(78,220,200,100);
    Uint32 result3=applyLight(color3,0.05);
    output(result3);

    return 0;
}

Expected Output is:

alpha=156 red=90 green=45 blue=18
alpha=255 red=25 green=25 blue=25
alpha=78 red=11 green=10 blue=5

Upvotes: 2

user1118321

Reputation: 26395

One thing that I don't see anyone else mentioning is parallelizing your code. There are at least 2 ways to do this: SIMD instructions, and multiple threads.

SIMD instructions (like SSE, AVX, etc.) perform the same math on multiple pieces of data at the same time. So you could, for example, multiply the red, green, blue, and alpha of a pixel by the same values in 1 instruction, like this:

vec4 lightValue = vec4(0.5, 0.5, 0.5, 1.0);
vec4 result = vec_Mult(inputPixel, lightValue);

That's the equivalent of:

lightValue.red = 0.5;
lightValue.green = 0.5;
lightValue.blue = 0.5;
lightValue.alpha = 1.0;

result.red = inputPixel.red * lightValue.red;
result.green = inputPixel.green * lightValue.green;
result.blue = inputPixel.blue * lightValue.blue;
result.alpha = inputPixel.alpha * lightValue.alpha;

You can also cut your image into tiles and perform the lightening operation on several tiles at once using threads run on multiple cores. If you're using C++11, you can use std::thread to start multiple threads. Otherwise your OS probably has functionality for threading, such as WinThreads, Grand Central Dispatch, pthreads, boost threads, Threading Building Blocks, etc.

You can combine both of the above and have multithreaded code that operates on whole pixels at a time.

If you want to take it even further, you can do your processing on the GPU of your machine using OpenGL, OpenCL, DirectX, Metal, Mantle, CUDA, or one of the other GPGPU technologies. GPUs are generally hundreds of cores that can very quickly process many tiles in parallel, each of which processes whole pixels (rather than just channels) at a time.

But an even better option may be not to write any code at all. It's extremely likely that someone has already done this work and you can leverage it. For example, on MacOS there's CoreImage and the Accelerate framework. On iOS you also have CoreImage, and there's also GPUImage. I'm sure there are similar libraries on Windows, Linux, and other OSes you might be working with.

Upvotes: 2

Bids

Reputation: 2449

Shifts and masks like this are generally very fast on a modern processor. I might look at a few other things:

Follow the first rule of optimisation - profile your code. You can do this simply by calling the method millions of times and timing it. Are your calculations slow, or is it something else? What is slow? Try omitting part of the method - do things speed up?
Make sure that this function is declared inline (and make sure it has actually been inlined). The function call overhead will massively outweigh the pixel manipulations (particularly if it is virtual).
Consider declaring your method Uint32 PixelRenderer::applyLight(Uint32 color) and returning the modified value, that may help avoid some dereferences and give the compiler some additional optimisation opportunities.
Avoid fp to integer conversions, they can be very expensive. If a plain integer divide is insufficient, look at using fixed-point math.

Finally, look at the assembler to see what the compiler has generated (with optimisations on). Are there any branches or conversions? Has your method actually been inlined?

Upvotes: 3

tofi9

Reputation: 5853

Another solution without using bit shifters, is to convert your 32 bits uint into a struct.
Try to keep your implementation in the .h include file, so that it can be inlined
If you don't want to have the implementation inlined (see above), modify your applyLight method to accept an array of pixels. Method call overhead can be significant for such a small method
Enable "loop unroll" optimisation on your compiler, which will enable the usage of SIMD instructions

Implementation:

class brightness {
private:
    struct pixel { uint8_t b, g, r, a; };
    float factor;

    static inline void apply(uint8_t& p, float f) {
        p = max(min(int(p * f), 255),0);
    }

public:
    brightness(float factor) : factor(factor) { }

    void apply(uint32_t& color){
        pixel& p = (pixel&)color;

        apply(p.b, factor);
        apply(p.g, factor);
        apply(p.r, factor);
    }
};

Implementation with a lookup table (slower when you use "loop unroll"):

class brightness {

    struct pixel { uint8_t b, g, r, a; };

    uint8_t table[256];

public:
    brightness(float factor) {
        for(int i = 0; i < 256; i++)
            table[i] = max(min(int(i * factor), 255), 0);
    }

    void apply(uint32_t& color){
        pixel& p = (pixel&)color;

        p.b = table[p.b];
        p.g = table[p.g];
        p.r = table[p.r];
    }
};




// usage
brightness half_bright(0.5);
uint32_t pixel = 0xffffffff;
half_bright.apply(pixel);

Upvotes: 1

Mike Nakis

Reputation: 62054

Try this: (EDIT: as it turns out, this is only a readability improvement, but read on for more insights.)

void PixelRenderer::applyLight(Uint32& color)
{
    Uint32 alpha = color >> 24;
    Uint32 red = (color >> 16) & 0xff;
    Uint32 green = (color >> 8) & 0xff;
    Uint32 blue = color & 0xff;
    red = red * 0.5;
    green = green * 0.5;
    blue = blue * 0.5;
    color = alpha << 24 | red << 16 | green << 8 | blue;
}

That having been said, you should understand that performing operations of that sort using a general-purpose processor such as the CPU of your computer is bound to be extremely slow. That's why hardware-accelerated graphics cards were invented.

EDIT

If you insist on operating this way, then you will probably have to resort to hacks in order to improve efficiency. One type of hack which is very often used when dealing with 8-bit channel values is lookup tables. With a lookup table, instead of multiplying each individual channel value by a float, you precompute an array of 256 values where the index into the array is a channel value, and the value in that index is the precomputed result of multiplying the channel value by that float. Then, when converting your image, you just use channel values to lookup entries of the array instead of performing actual float multiplication. This is much, much faster. (But still not nearly as fast as programming dedicated, massively parallel hardware do that stuff for you.)

EDIT

As others have already pointed out, if you are not planning to operate on the alpha channel, then you do not need to extract it and then later apply it, you can just leave it unaltered. So, you can just do color = (color & 0xff000000) | red << 16 | green << 8 | blue;

Upvotes: 3

How to most efficiently modify R / G / B values?

Answers (5)

Related Questions