user5818733
user5818733

Reputation:

How is Win32 Bitmap rendering faster than pixels?

Win32 bitmaps are (a lot) faster to draw compared to SetPixelV or another function such as. How does this work, if at the end the computer will be drawing pixels for the bitmap?

Upvotes: 7

Views: 1685

Answers (2)

Yakk - Adam Nevraumont
Yakk - Adam Nevraumont

Reputation: 275220

Suppose you have a pixel. This pixel has color components A B and C. The surface you are drawing to has color components X Y and Z.

So first you need to check if they match. If they don't match, costs go up. Assume they match.

Next, you need to do bounds checking -- did the caller give you something stupid? Some comparisons, additions and multiplications.

Next, you need to find where the pixel is. This is some of multiplications and additions.

Now, you have to access the source data and the destination data and write it.


If you are working a scanline at a time, almost all of that overhead above can be done once. You can calculate what part of the scanline falls in bounds or not with only a bit more overhead than doing one pixel. You can find where the scanline writes in the destination with again only a bit more overhead than one pixel. You can check color space conversions with the same overhead as one pixel.

The big difference is that instead of copying one pixel, you copy in a block.

As it happens, computer are really good at copying blocks of things. There are built-in instructions on some CPUs, some memory systems can do it without the CPU being involved (CPU says "copy X to Y", then can do other things; and memory-to-memory bandwidth might be higher than memory-to-CPU-to-memory). Even if you are round-tripping through the CPU, there are SIMD instructions that let you work on 2, 4, 8, 16 or even more units of data at the same time, so long as you work on them in the same way using a limited instruction set.

In some cases, you can even offload work to the GPU -- if both source and destination scanline are on the GPU, you can say "yo GPU, you handle it", and the GPU is even more specialized for doing that kind of task.

The very first bit of optimization -- only having to do checks once per scanline instead of once per pixel -- can easily give you a 2x to ~10x speedup. The second -- more efficient blitting -- another 4x to ~20x faster. Doing everything on the GPU can be ~2x to 100x faster.

The final thing is the overhead of actually calling the function. Usually this is minor; but when calling SetPixel 1 million times (a 1000 x 1000 image, or a modest sized screen) it adds up.

For an HD display with 2 million pixels, 60 times per second is 120 million pixels manipulated per second. A single threaded program on a 3 GHz machine only has room to run ~25 instructions per pixel if you want to keep up with the screen, and that assumes nothing else happens (which is unlikely). On a 4k monitor you are down to 6 instructions per pixel.

With that many pixels being played with, shaving off every instruction you can makes a big difference.


Multipliers pulled out of nowhere. I've written some conversion of per-pixel operations to per-scanline operations that have shown impressive speedups, however, and ditto for CPU to GPU loads, and have seen SIMD give impressive speedups.

Upvotes: 4

paddy
paddy

Reputation: 63451

Repeated calls to a function like SetPixelV are slow because it must translate a co-ordinate into a memory offset each time, and is also potentially doing some colour translation on the fly.

A simple "set pixel" function might look like this (without bounds-tests, colour translation or anything fancy):

size_t offset = y * bytes_per_scanline + x * bytes_per_pixel;
for(size_t i = offset; i < offset + bytes_per_pixel; i++) 
    target[i] = source[i];

Bitmaps, on the other hand, are generally drawn via a process known as blitting. This is essentially a direct copy from one memory location to another. To achieve this in Windows, you create a device context for your bitmap that is compatible with the target context. That ensures the memory can be copied without translation. It may also provide for hardware-accelerated copies which are even faster.

A simple "copy" blit might look like this:

size_t nbytes = bytes_per_scanline * height;
for(size_t i = 0; i < nbytes; i++)
    target[i] = source[i];

This involves no co-ordinate lookups, and will be very efficient in terms of memory cache accesses. There are much faster ways to copy chunks of memory, and the above example is simply to illustrate.

Upvotes: 1

Related Questions