Jorjon
Jorjon

Reputation: 5434

Convert BYTE buffer (0-255) to float buffer (0.0-1.0)

How can I convert a BYTE buffer (from 0 to 255) to a float buffer (from 0.0 to 1.0)? Of course there should be a relation between the two values, eg: 0 in byte buffer will be .0.f in float buffer, 128 in byte buffer will be .5f in float buffer, 255 in byte buffer will be 1.f in float buffer.

Actually this is the code that I have:

for (int y=0;y<height;y++) {
    for (int x=0;x<width;x++) {
        float* floatpixel = floatbuffer + (y * width + x) * 4;
        BYTE* bytepixel = (bytebuffer + (y * width + x) * 4);
        floatpixel[0] = bytepixel[0]/255.f;
        floatpixel[1] = bytepixel[1]/255.f;
        floatpixel[2] = bytepixel[2]/255.f;
        floatpixel[3] = 1.0f; // A
    }
}

This runs very slow. A friend of mine suggested me to use a conversion table, but I wanted to know if someone else can give me another approach.

Upvotes: 7

Views: 8996

Answers (7)

sam hocevar
sam hocevar

Reputation: 12129

I know this is an old question, but since no one gave a solution using the IEEE float representation, here is one.

// Use three unions instead of one to avoid pipeline stalls
union { float f; uint32_t i; } t, u, v, w;
t.f = 32768.0f;
float const b = 256.f / 255.f;

for(int size = width * height; size > 0; --size)
{
    u.i = t.i | bytepixel[0]; floatpixel[0] = (u.f - t.f) * b;
    v.i = t.i | bytepixel[1]; floatpixel[1] = (v.f - t.f) * b;
    w.i = t.i | bytepixel[2]; floatpixel[2] = (w.f - t.f) * b;
    floatpixel[3] = 1.0f; // A
    floatpixel += 4;
    bytepixel += 4;
}

This is more than twice as fast as an int to float conversion on my computer (Core 2 Duo CPU).

Here is an SSE3 version of the above code that does 16 floats at a time. It requires bytepixel and floatpixel to be 128-bit aligned, and the total size to be a multiple of 4. Note that the SSE3 built-in int to float conversions will not help much here, as they will require an additional multiplication anyway. I believe this is the shortest way to go instruction-wise, but if your compiler isn't clever enough you may wish to unroll and schedule things by hand.

/* Magic values */
__m128i zero = _mm_set_epi32(0, 0, 0, 0);
__m128i magic1 = _mm_set_epi32(0xff000000, 0xff000000, 0xff000000, 0xff000000);
__m128i magic2 = _mm_set_epi32(0x47004700, 0x47004700, 0x47004700, 0x47004700);
__m128 magic3 = _mm_set_ps(32768.0f, 32768.0f, 32768.0f, 32768.0f);
__m128 magic4 = _mm_set_ps(256.0f / 255.0f, 256.0f / 255.0f, 256.0f / 255.0f, 256.0f / 255.0f);

for(int size = width * height / 4; size > 0; --size)
{
    /* Load bytes in vector and force alpha value to 255 so that
     * the output will be 1.0f as expected. */
    __m128i in = _mm_load_si128((__m128i *)bytepixel);
    in = _mm_or_si128(in, magic1);

    /* Shuffle bytes into four ints ORed with 32768.0f and cast
     * to float (the cast is free). */
    __m128i tmplo = _mm_unpacklo_epi8(in, zero);
    __m128i tmphi = _mm_unpackhi_epi8(in, zero);
    __m128 in1 = _mm_castsi128_ps(_mm_unpacklo_epi16(tmplo, magic2));
    __m128 in2 = _mm_castsi128_ps(_mm_unpackhi_epi16(tmplo, magic2));
    __m128 in3 = _mm_castsi128_ps(_mm_unpacklo_epi16(tmphi, magic2));
    __m128 in4 = _mm_castsi128_ps(_mm_unpackhi_epi16(tmphi, magic2));

    /* Subtract 32768.0f and multiply by 256.0f/255.0f */
    __m128 out1 = _mm_mul_ps(_mm_sub_ps(in1, magic3), magic4);
    __m128 out2 = _mm_mul_ps(_mm_sub_ps(in2, magic3), magic4);
    __m128 out3 = _mm_mul_ps(_mm_sub_ps(in3, magic3), magic4);
    __m128 out4 = _mm_mul_ps(_mm_sub_ps(in4, magic3), magic4);

    /* Store 16 floats */
    _mm_store_ps(floatpixel, out1);
    _mm_store_ps(floatpixel + 4, out2);
    _mm_store_ps(floatpixel + 8, out3);
    _mm_store_ps(floatpixel + 12, out4);

    floatpixel += 16;
    bytepixel += 16;
}

Edit: improve accuracy by using (f + c/b) * b instead of f * b + c.

Edit: add SSE3 version.

Upvotes: 8

Viet
Viet

Reputation: 18414

Look-up table is the fastest way to convert :) Here you go:

Python code to generate the byte_to_float.h file to include:

#!/usr/bin/env python

def main():
    print "static const float byte_to_float[] = {"

    for ii in range(0, 255):
        print "%sf," % (ii/255.0)

    print "1.0f };"    
    return 0

if __name__ == "__main__":
    main()

And C++ code to get the conversion:

floatpixel[0] = byte_to_float[ bytepixel[0] ];

Simple isn't it?

Upvotes: 1

Rodyland
Rodyland

Reputation: 558

Don't calculate 1/255 every time. Don't know if a compiler will be smart enough to remove this. Calculate it once and reapply it every time. Even better, define it as a constant.

Upvotes: 0

moonshadow
moonshadow

Reputation: 89145

Whether you choose to use a lookup table or not, your code is doing a lot of work each loop iteration that it really does not need to - likely enough to overshadow the cost of the convert and multiply.

Declare your pointers restrict, and pointers you only read from const. Multiply by 1/255th instead of dividing by 255. Don't calculate the pointers in each iteration of the inner loop, just calculate initial values and increment them. Unroll the inner loop a few times. Use vector SIMD operations if your target supports it. Don't increment and compare with maximum, decrement and compare with zero instead.

Something like

float* restrict floatpixel = floatbuffer;
BYTE const* restrict bytepixel = bytebuffer;
for( int size = width*height; size > 0; --size )
{
    floatpixel[0] = bytepixel[0]*(1.f/255.f);
    floatpixel[1] = bytepixel[1]*(1.f/255.f);
    floatpixel[2] = bytepixel[2]*(1.f/255.f);
    floatpixel[3] = 1.0f; // A
    floatpixel += 4;
    bytepixel += 4;
}

would be a start.

Upvotes: 9

xtofl
xtofl

Reputation: 41519

You need to find out what the bottleneck is:

  • if you iterate your data tables in the 'wrong' direction, you constantly hit a cache miss. No lookup will ever help get around that.
  • if your processor is slower in scaling than in looking up, you can boost performance by looking up, provided the lookup table fits it's cache.

Another tip:

struct Scale {
    BYTE operator()( const float f ) const { return f * 1./255; }
};
std::transform( float_table, float_table + itssize, floatpixel, Scale() );

Upvotes: 2

laalto
laalto

Reputation: 152887

Yes, a lookup table is definitely faster than doing a lot of divisions in a loop. Just generate a table of 256 precomputed float values and use the byte value to index that table.

You can also optimize the loop a little by removing the index computation and just do something like

float *floatpixel = floatbuffer;
BYTE *bytepixel = bytebuffer;

for (...) {
  *floatpixel++ = float_table[*bytepixel++];
  *floatpixel++ = float_table[*bytepixel++];
  *floatpixel++ = float_table[*bytepixel++];
  *floatpixel++ = 1.0f;
}

Upvotes: 1

Mats Fredriksson
Mats Fredriksson

Reputation: 20101

Use a static lookup table for this. When I worked in a computer graphics company we ended up having a hard coded lookup table for this that we linked in with the project.

Upvotes: 2

Related Questions