Reputation: 123
So I wanted to implement lighting in my pixel based rendering system, googled and found out to display R / G / B values lighter or darker I have to multiply each red green and blue value by a number < 1 to display it darker and by a number > 1 to display it lighter.
So I implemented it like this, but its really dragging down my performance since I have to do this for each pixel:
void PixelRenderer::applyLight(Uint32& color){
Uint32 alpha = color >> 24;
alpha << 24;
alpha >> 24;
Uint32 red = color >> 16;
red = red << 24;
red = red >> 24;
Uint32 green = color >> 8;
green = green << 24;
green = green >> 24;
Uint32 blue = color;
blue = blue << 24;
blue = blue >> 24;
red = red * 0.5;
green = green * 0.5;
blue = blue * 0.5;
color = alpha << 24 | red << 16 | green << 8 | blue;
}
Any ideas or examples on how to improve the speed?
Upvotes: 3
Views: 2382
Reputation: 8589
To preserve the alpha value in the front use:
(color>>1)&0x7F7F7F | (color&0xFF000000)
(A tweak on what Wimmel offered in the comments).
I think the 'learning curve' here is that you were using shift and shift back to mask out bits. You should use &
with a masking value.
For a more general solution (where 0.0<=factor<=1.0
) :
void PixelRenderer::applyLight(Uint32& color, double factor){
Uint32 alpha=color&0xFF000000;
Uint32 red= (color&0x00FF0000)*factor;
Uint32 green= (color&0x0000FF00)*factor;
Uint32 blue=(color&0x000000FF)*factor;
color=alpha|(red&0x00FF0000)|(green&0x0000FF00)|(blue&0x000000FF);
}
Notice there is no need to shift the components down to the low order bits before performing the multiplication.
Ultimately you may find that the bottleneck is floating point conversions and arithmetic.
To reduce that you should consider either:
Reduce it to a scaling factor for example in the range 0-256.
Precompute factor*component
as a 256 element array and 'pick' the components out oft.
I'm proposing a range of 257 because you can achieve the factor as follows:
For a more general solution (where 0<=factor<=256
) :
void PixelRenderer::applyLight(Uint32& color, Uint32 factor){
Uint32 alpha=color&0xFF000000;
Uint32 red= ((color&0x00FF0000)*factor)>>8;
Uint32 green= ((color&0x0000FF00)*factor)>>8;
Uint32 blue=((color&0x000000FF)*factor)>>8;
color=alpha|(red&0x00FF0000)|(green&0x0000FF00)|(blue&0x000000FF);
}
Here's a runnable program illustrating the first example:
#include <stdio.h>
#include <inttypes.h>
typedef uint32_t Uint32;
Uint32 make(Uint32 alpha,Uint32 red,Uint32 green,Uint32 blue){
return (alpha<<24)|(red<<16)|(green<<8)|blue;
}
void output(Uint32 color){
printf("alpha=%"PRIu32" red=%"PRIu32" green=%"PRIu32" blue=%"PRIu32"\n",(color>>24),(color&0xFF0000)>>16,(color&0xFF00)>>8,color&0xFF);
}
Uint32 applyLight(Uint32 color, double factor){
Uint32 alpha=color&0xFF000000;
Uint32 red= (color&0x00FF0000)*factor;
Uint32 green= (color&0x0000FF00)*factor;
Uint32 blue=(color&0x000000FF)*factor;
return alpha|(red&0x00FF0000)|(green&0x0000FF00)|(blue&0x000000FF);
}
int main(void) {
Uint32 color1=make(156,100,50,20);
Uint32 result1=applyLight(color1,0.9);
output(result1);
Uint32 color2=make(255,255,255,255);
Uint32 result2=applyLight(color2,0.1);
output(result2);
Uint32 color3=make(78,220,200,100);
Uint32 result3=applyLight(color3,0.05);
output(result3);
return 0;
}
Expected Output is:
alpha=156 red=90 green=45 blue=18
alpha=255 red=25 green=25 blue=25
alpha=78 red=11 green=10 blue=5
Upvotes: 2
Reputation: 26395
One thing that I don't see anyone else mentioning is parallelizing your code. There are at least 2 ways to do this: SIMD instructions, and multiple threads.
SIMD instructions (like SSE, AVX, etc.) perform the same math on multiple pieces of data at the same time. So you could, for example, multiply the red, green, blue, and alpha of a pixel by the same values in 1 instruction, like this:
vec4 lightValue = vec4(0.5, 0.5, 0.5, 1.0);
vec4 result = vec_Mult(inputPixel, lightValue);
That's the equivalent of:
lightValue.red = 0.5;
lightValue.green = 0.5;
lightValue.blue = 0.5;
lightValue.alpha = 1.0;
result.red = inputPixel.red * lightValue.red;
result.green = inputPixel.green * lightValue.green;
result.blue = inputPixel.blue * lightValue.blue;
result.alpha = inputPixel.alpha * lightValue.alpha;
You can also cut your image into tiles and perform the lightening operation on several tiles at once using threads run on multiple cores. If you're using C++11, you can use std::thread
to start multiple threads. Otherwise your OS probably has functionality for threading, such as WinThreads, Grand Central Dispatch, pthreads, boost threads, Threading Building Blocks, etc.
You can combine both of the above and have multithreaded code that operates on whole pixels at a time.
If you want to take it even further, you can do your processing on the GPU of your machine using OpenGL, OpenCL, DirectX, Metal, Mantle, CUDA, or one of the other GPGPU technologies. GPUs are generally hundreds of cores that can very quickly process many tiles in parallel, each of which processes whole pixels (rather than just channels) at a time.
But an even better option may be not to write any code at all. It's extremely likely that someone has already done this work and you can leverage it. For example, on MacOS there's CoreImage and the Accelerate framework. On iOS you also have CoreImage, and there's also GPUImage. I'm sure there are similar libraries on Windows, Linux, and other OSes you might be working with.
Upvotes: 2
Reputation: 2449
Shifts and masks like this are generally very fast on a modern processor. I might look at a few other things:
Uint32 PixelRenderer::applyLight(Uint32 color)
and returning the modified value, that may help avoid some dereferences and give the compiler some additional optimisation opportunities. Finally, look at the assembler to see what the compiler has generated (with optimisations on). Are there any branches or conversions? Has your method actually been inlined?
Upvotes: 3
Reputation: 5853
32 bits uint
into a struct
..h
include file, so that it can be inlinedapplyLight
method to accept an array of pixels. Method call overhead can be significant for such a small methodImplementation:
class brightness {
private:
struct pixel { uint8_t b, g, r, a; };
float factor;
static inline void apply(uint8_t& p, float f) {
p = max(min(int(p * f), 255),0);
}
public:
brightness(float factor) : factor(factor) { }
void apply(uint32_t& color){
pixel& p = (pixel&)color;
apply(p.b, factor);
apply(p.g, factor);
apply(p.r, factor);
}
};
Implementation with a lookup table (slower when you use "loop unroll"):
class brightness {
struct pixel { uint8_t b, g, r, a; };
uint8_t table[256];
public:
brightness(float factor) {
for(int i = 0; i < 256; i++)
table[i] = max(min(int(i * factor), 255), 0);
}
void apply(uint32_t& color){
pixel& p = (pixel&)color;
p.b = table[p.b];
p.g = table[p.g];
p.r = table[p.r];
}
};
// usage
brightness half_bright(0.5);
uint32_t pixel = 0xffffffff;
half_bright.apply(pixel);
Upvotes: 1
Reputation: 62054
Try this: (EDIT: as it turns out, this is only a readability improvement, but read on for more insights.)
void PixelRenderer::applyLight(Uint32& color)
{
Uint32 alpha = color >> 24;
Uint32 red = (color >> 16) & 0xff;
Uint32 green = (color >> 8) & 0xff;
Uint32 blue = color & 0xff;
red = red * 0.5;
green = green * 0.5;
blue = blue * 0.5;
color = alpha << 24 | red << 16 | green << 8 | blue;
}
That having been said, you should understand that performing operations of that sort using a general-purpose processor such as the CPU of your computer is bound to be extremely slow. That's why hardware-accelerated graphics cards were invented.
EDIT
If you insist on operating this way, then you will probably have to resort to hacks in order to improve efficiency. One type of hack which is very often used when dealing with 8-bit channel values is lookup tables. With a lookup table, instead of multiplying each individual channel value by a float, you precompute an array of 256 values where the index into the array is a channel value, and the value in that index is the precomputed result of multiplying the channel value by that float. Then, when converting your image, you just use channel values to lookup entries of the array instead of performing actual float multiplication. This is much, much faster. (But still not nearly as fast as programming dedicated, massively parallel hardware do that stuff for you.)
EDIT
As others have already pointed out, if you are not planning to operate on the alpha channel, then you do not need to extract it and then later apply it, you can just leave it unaltered. So, you can just do color = (color & 0xff000000) | red << 16 | green << 8 | blue;
Upvotes: 3