Reputation: 105

How to speed up the run time of of a time-consuming calculation

I'm trying to write a function to convert an image into characters and colors for the windows console. At the moment the calculation takes about 13 seconds with a 700x700 pixel image but that time is undesirable especially when I plan on making the function more complex in order to account for character shapes.

What are some methods to speed up heavy calculations and loops like below in C++? I've been recommended multiple threads, SIMD, and inline assembly but how would I go about improving a function like below with those methods?

This is the current code I'm using.

unsigned char characterValues[256] = { 0 };

// This operation can be done ahead of time when the program is started up
{
    ResourceInputStream in = ResourceInputStream();
    // This image is the font for the console. The background color is black while the foreground color is white
    in.open(BMP_FONT, 2); // 2 is for RT_BITMAP, BMP_FONT is a resource
    if (in.isOpen()) {
        auto bmp = readBitmap(&in, true);
        in.close();
        for (int x = 0; x < bmp->size.x; x++) {
            for (int y = 0; y < bmp->size.y; y++) {
                int charIndex = (x / 8) + (y / 12) * 16;
                if (bmp->pixels[x][y].r == 255)
                    characterValues[charIndex]++;
            }
        }
    }
}
// This operation is for asciifying the image
{
    FileInputStream in = FileInputStream();
    in.open(R"(image-path.bmp)");
    if (in.isOpen()) {
        auto bmp = readBitmap(&in, false);
        in.close();

        auto image = /* make default image here */
        Point2I imageSize = (Point2I)GMath::ceil((Point2F)bmp->size / Point2F(8.0f, 12.0f));
        int totalImageSize = imageSize.x * imageSize.y;
        image->resize(imageSize);
        auto palette = /* get palette of 16 colors here */

        // Iterate through each (character area)
        for (int imgx = 0; imgx < imageSize.x; imgx++) {
            for (int imgy = 0; imgy < imageSize.y; imgy++) {

                // Read image color value
                int r = 0, g = 0, b = 0;
                int totalRead = 0;
                // Read each pixel inside the bounds of a single character
                for (int px = 0; px < 8; px++) {
                    for (int py = 0; py < 12; py++) {
                        Point2I p = Point2I(imgx * 8 + px, imgy * 12 + py);
                        if (p < bmp->size) {
                            r += bmp->pixels[p.x][p.y].r;
                            g += bmp->pixels[p.x][p.y].g;
                            b += bmp->pixels[p.x][p.y].b;
                            totalRead++;
                        }
                    }
                }
                Color imageValue = Color(r / totalRead, g / totalRead, b / totalRead);

                // A combo of a character and foreground/background color
                Pixel closestPixel = Pixel();
                float closestScore = std::numeric_limits<float>().max();
                for (int col = 1; col < 255; col++) {
                    unsigned char f = getFColor(col);
                    unsigned char b = getBColor(col);
                    for (int ch = 1; ch < 255; ch++) {
                        // Calculate values
                        Color value = Color(
                            (palette[f].r * characterValues[ch] + palette[b].r * (TOTAL_CHARACTER_VALUE - characterValues[ch])) / TOTAL_CHARACTER_VALUE,
                            (palette[f].g * characterValues[ch] + palette[b].g * (TOTAL_CHARACTER_VALUE - characterValues[ch])) / TOTAL_CHARACTER_VALUE,
                            (palette[f].b * characterValues[ch] + palette[b].b * (TOTAL_CHARACTER_VALUE - characterValues[ch])) / TOTAL_CHARACTER_VALUE
                        );
                        // Add up score here
                        float score =
                            (float)((int)value.r - (int)imageValue.r) * (float)((int)value.r - (int)imageValue.r) +
                            (float)((int)value.g - (int)imageValue.g) * (float)((int)value.g - (int)imageValue.g) +
                            (float)((int)value.b - (int)imageValue.b) * (float)((int)value.b - (int)imageValue.b);
                        if (score < closestScore) {
                            closestPixel = Pixel((unsigned char)ch, (unsigned char)col);
                            closestScore = score;
                        }
                    }
                }
                // Set the character/color combo here
            }
        }
    }
}

Upvotes: 0

Answers (2)

Jason

Reputation: 1099

you have an x loop and a nested y loop, are you sure that's the byte order in memory? It might be but you can always try the other way around if it helps.

// could be faster, depending on data structure
 for (int y = 0; y < bmp->size.y; y++) {
     for (int x = 0; x < bmp->size.x; x++) {

but since the bmp indexes go [x][y], it looks like it's column-first data, which is odd.

There are also costly divisions in your inner loop. You can do any loop-invariant calculations outside each loop:

 for (int x = 0; x < bmp->size.x; x++) { 
     int charIndex_x = (x / 8);
     for (int y = 0; y < bmp->size.y; y++) {
          int charIndex = charIndex_x + (y / 12) * 16;
          // other stuff

It could still be improved further, but you just avoided doing almost 65536 division operations on doing this for a 256x256 bitmap.

additionally, there is a 2D array dereference in your inner loop, those are costly operations. You can record a pointer to the start of the column then increment the pointer:

 for (int x = 0; x < bmp->size.x; x++) {
     int charIndex_x = (x / 8);
     auto current_pixel = &bmp->pixels[x][0];
     for (int y = 0; y < bmp->size.y; y++) {
          int charIndex = charIndex_x + (y / 12) * 16;
          if (*current_pixel.r == 255)
                characterValues[charIndex]++;
          ++current_pixel;

And increment it in the inner loop. You could in fact move the current_pixel setup, right outside the x loop as well, but I've had situation where that was slower since it has to keep more variables alive inside the memory. Generally you want local variables in the inner loop if possible. Moving calculations outside loops speeds things up but uses more CPU memory, meaning it could be slower due to juggling more stored values.

the last thing to note is that each time through your inner loop you're checking whether the y value is less than "bmp->size.y" this includes looking up bmp then referencing size, then referencing size.y, which is three operations, happening 65536 time for a 256x256 bitmap. You can record the y size in a local variable before needing it:

 for (int x = 0; x < bmp->size.x; x++) {
     int charIndex_x = (x / 8);
     auto current_pixel = &bmp->pixels[x][0];
     int bmp_size_y = bmp->size.y;
     for (int y = 0; y < bmp_size.y; y++) {
          int charIndex = charIndex_x + (y / 12) * 16;
          if (*current_pixel.r == 255)
                characterValues[charIndex]++;
          ++current_pixel;

You could move it outside the x loop altogether, to avoid setting the value 256 times, as bmp->size.y never changes, but the saving for that is very small, and it might even slow things down as it would use up and extra register, which can mean the program needs to juggle more things in memory.

CPU memory is like virtual memory on your Windows PC. If too much is used, things slow down because it's paging stuff to disk, but having more stuff in memory can also speed things up because it doesn't need to constantly look things up from disk. Coding is similar in that local variables can be stored just in the CPU, avoiding having to look them up from memory, but too many local variables can overload the CPU meaning it needs to keep juggling them like virtual memory does. So make local variables as "local" as possible to avoid overusing them. You should always profile any changes you make to see if they really helped.

~~~

As for your other loop, you have many complex repeated calculations inside the inner loop:

bmp->pixels[p.x][p.y]

is calculated three times, and this includes a pointer dereference, two member dereferces (p.x and p.y) then a 2D array dereference (which at best is a multiply and an add, then a pointer dereference). That's at least 6 atomic computations right there, just to get the reference to that pixel each time.

Instead you can go:

auto current_pixel = bmp->pixels[p.x][p.y];

Better, you're calculating a Point2I then checking whether the x and y values of that are inside the bmp size. You don't need the Point2I at all, just calculate the x and y sizes and compare to the to the bmp x and y size individually.

Calculate the x bounds in the outer loop, do the if-check for x there, and you avoid hitting the inner loop at all if x is out of bounds. Combine that with avoiding having to create or index structures inside the inner loop and you get:

           for (int px = 0; px < 8; px++) {
                int p_x = imgx * 8 + px;
                if(p_x < bmp->size.x) {
                    for (int py = 0; py < 12; py++) {
                        int p_y = imgy * 12 + py;
                        if (p_y < bmp->size.y) {
                            auto pixel = bmp->pixels[p_x][p_y];
                            r += pixel.r;
                            g += pixel.g;
                            b += pixel.b;
                            totalRead++;
                        }
                    }
                }
            }

Upvotes: 2

user207421

Reputation: 310957

for (int x = 0; x < bmp->size.x; x++) {
    for (int y = 0; y < bmp->size.y; y++) {

Start both these loops at the top value, i.e. bmp->size.x-1 and bmp->size.y-1 respectively, and run them down to zero. That way you're only evaluating the boundary conditions once per loop instead of each iteration.

int charIndex = (x / 8) + (y / 12) * 16;

Don't compute this unless you're going to use it, i.e. put it into the following if block.

Upvotes: 1

How to speed up the run time of of a time-consuming calculation

Answers (2)

Related Questions