Speeding up large amounts of array related computation, visual studio

I'm wondering what the best approach may be for speeding up heavy amounts of array computation. Lets say I have this scenario:

int template_t[] = {1, 2, 3, 4, 5, 6, ...., 125};
int image[3200][5600];
int template_image[3200][5600];

for(int i = 0; i < 3200; i++) {
    for(int j = 0; j < 5600; j++) {

        // iterate over template to find template value per pixel
        for(int h = 0; h < template_length; h++)
            template_image[i][j] += template_t[h] * image[i][j];

    }
}

Of course my situation is much more complex, but the same idea applies. I have some large array representing the pixels within an image, and I need to apply some template array to each pixel to calculate a value to be placed in the template image.

I've thought about a couple ways to speed this up:

SIMD instructions? However I can't seem to find any resources for writing SIMD specific code in visual studio.
Parallelization - though I already parallalize the entire execution itself, so the program runs X instances of itself based on X cores. The program's input is a large amounts of image files, so those X instances would all be processing separate files.

What would give me the most bang for my buck? Thanks for any suggestions!

Upvotes: 0

Answers (2)

Peter Cordes

Reputation: 364408

First of all, use _t names for types, not arrays. Let's call the array template_multipliers[].

If template_multipliers is const, and the variables are unsigned, the compiler can sum it at compile-time and optimize away the inner loop entirely.

For gcc, we also get better code from hoisting the sum of template_t out of the loop. In this case, it manages to sum at compile time even with int, instead of unsigned int.

Signed overflow is undefined behaviour, and this may be why gcc is sometimes bad about knowing what will actually happen on the target machine it's optimizing for. (e.g. on x86 it's not something you need to avoid. The final result of a sequence of additions doesn't depend on the order of operations, even if some orders produce signed overflow in a temporary result and some don't. gcc doesn't always take advantage of addition's associativity in the signed case).

This is purely a gcc limitation. Your code must avoid signed overflow in the source-level order of operations, but if the compiler knows that you'll get the same results from doing something else that's faster, it can and should.

// aligning the arrays makes gcc's asm output *MUCH* shorter: no fully-unrolled prologue/epilogue for handling unaligned elements
#define DIM1 320
#define DIM2 1000
alignas(32) unsigned int image[DIM1][DIM2];
alignas(32) unsigned int template_image[DIM1][DIM2];

// with const, gcc can sum them at compile time.
const
static unsigned int template_multipliers[] = {1, 2, 3, 4, 5, 6, 7, 8, 8, 10, 11, 12, 13,   125};
const static int template_length = sizeof(template_multipliers) / sizeof(template_multipliers[0]);


void loop_hoisted(void) {
  for(int i = 0; i < DIM1; i++) {
    for(int j = 0; j < DIM2; j++) {
        // iterate over template to find template value per pixel
        unsigned int tmp = 0;
        for(int h = 0; h < template_length; h++)
            tmp += template_multipliers[h];
        template_image[i][j] += tmp * image[i][j];

    }
  }
}

gcc 5.3 with -O3 -fverbose-asm -march=haswell auto-vectorizes this with an inner loop of:

# gcc inner loop: ymm1 = set1(215) = sum of template_multipliers
.L2:
    vpmulld ymm0, ymm1, YMMWORD PTR [rcx+rax] # vect__16.10, tmp115, MEM[base: vectp_image.8_4, index: ivtmp.18_90, offset: 0B]
    vpaddd  ymm0, ymm0, YMMWORD PTR [rdx+rax]   # vect__17.12, vect__16.10, MEM[base: vectp_template_image.5_84, index: ivtmp.18_90, offset: 0B]
    vmovdqa YMMWORD PTR [rdx+rax], ymm0       # MEM[base: vectp_template_image.5_84, index: ivtmp.18_90, offset: 0B], vect__17.12
    add     rax, 32   # ivtmp.18,
    cmp     rax, 4000 # ivtmp.18,
    jne     .L2       #,

This is 9 fused-domain uops in the inner loop on Intel Haswell, since pmulld is 2 uops on Haswell and later (and can't micro-fuse even with a one-register addressing mode). This means the loop can only run at one iteration per 3 clocks. gcc could have saved 2 uops (so it would run at one iteration per 2 clocks) by using a pointer-increment for the destination, and a dst + src-dst 2-register addressing mode for the src (since it can't micro-fuse anyway).

See the godbolt Compiler Explorer link for the full source of a less-modified version of the OP's code which doesn't hoist the summing of template_multipliers. It makes weird asm:

    unsigned int tmp = template_image[i][j];
    for(int h = 0; h < template_length; h++)
        tmp += template_multipliers[h] * image[i][j];
    template_image[i][j] = tmp;

.L8:  # ymm4 is a vector of set1(198)
    vmovdqa ymm2, YMMWORD PTR [rcx+rax]       # vect__22.42, MEM[base: vectp_image.41_73, index: ivtmp.56_108, offset: 0B]
    vpaddd  ymm1, ymm2, YMMWORD PTR [rdx+rax]   # vect__1.47, vect__22.42, MEM[base: vectp_template_image.38_94, index: ivtmp.56_108, offset: 0B]
    vpmulld ymm0, ymm2, ymm4  # vect__114.43, vect__22.42, tmp110
    vpslld  ymm3, ymm2, 3       # vect__72.45, vect__22.42,
    vpaddd  ymm0, ymm1, ymm0    # vect__2.48, vect__1.47, vect__114.43
    vpaddd  ymm0, ymm0, ymm3    # vect__29.49, vect__2.48, vect__72.45
    vpaddd  ymm0, ymm0, ymm3    # vect_tmp_115.50, vect__29.49, vect__72.45
    vmovdqa YMMWORD PTR [rdx+rax], ymm0       # MEM[base: vectp_template_image.38_94, index: ivtmp.56_108, offset: 0B], vect_tmp_115.50
    add     rax, 32   # ivtmp.56,
    cmp     rax, 4000 # ivtmp.56,
    jne     .L8       #,

It's doing some of the summing of template_multipliers every time through the loop. The numbers of adds in the loop changes depending on the values in the array (not just the number of values).

Most of these optimizations should be applicable to MSVC, unless whole-program link-time optimization lets it do the sum then even if template_multipliers is non-const.

Upvotes: 1

Paul Ogilvie

Reputation: 25286

One simple optimization, shouldn't the compiler already do that for you, would be:

    int p = template_image[i][j], p2= image[i][j];
    // iterate over template to find template value per pixel
    for(int h = 0; h < template_length; h++)
        p += template_t[h] * p2;

    template[i][j]= p;

looking further at this and the definition of your template as 1, 2, 3,..125, then p2 is multiplied with 1*2*3*4..*125, which is constant (let us call it CT) and so:

for (h..
    template_image[i][j] += template_t[h] * image[i][j];

is equivalent to

template_image[i][j] += CT * image[i][j];

so the algorithm becomes:

#define CT 1*2*3*4*5*6*7...*125 // must stil lbe completed
int image[3200][5600];
int template_image[3200][5600];

for(int i = 0; i < 3200; i++) {
    for(int j = 0; j < 5600; j++) {
        template_image[i][j] += CT * image[i][j];
    }
}

This may be parallellized over j.

Upvotes: 0

Speeding up large amounts of array related computation, visual studio

Answers (2)

Related Questions