ZHOU
ZHOU

Reputation: 1191

How to improve inline function efficiency?

I profiled my code and found that one inline function takes about 8% of the samples. The function is to convert matrix subscripts to indices. It is quite like the matlab function sub2ind.

inline int sub2ind(const int sub_height, const int sub_width, const int width) {
    return sub_height * width + sub_width;
}

I guess the compiler does not perform inline expansion, but I don't know how to check that out.

Is there any way to improve this? Or explicitly let the compiler perform inline expansion?

Upvotes: 5

Views: 281

Answers (2)

vitaut
vitaut

Reputation: 55685

As @user3528438 mentioned, you can look at the assembly output. Consider the following example:

inline int sub2ind(const int sub_height, const int sub_width, const int width) {
    return sub_height * width + sub_width;
}

int main() {
    volatile int n[] = {1, 2, 3};
    return sub2ind(n[0], n[1], n[2]);
}

Compiling it without optimization (g++ -S test.cc) results in the following code with sub2ind not inlined:

main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $32, %rsp
    movl    $1, -16(%rbp)
    movl    $2, -12(%rbp)
    movl    $3, -8(%rbp)
    movq    -16(%rbp), %rax
    movq    %rax, -32(%rbp)
    movl    -8(%rbp), %eax
    movl    %eax, -24(%rbp)
    movl    -24(%rbp), %edx
    movl    -28(%rbp), %ecx
    movl    -32(%rbp), %eax
    movl    %ecx, %esi
    movl    %eax, %edi
    call    _Z7sub2indiii ; call to sub2ind
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

while compiling with optimization (g++ -S -O3 test.cc) results in sub2ind being inlined and mostly optimized away:

main:
.LFB1:
    .cfi_startproc
    movl    $1, -24(%rsp)
    movl    $2, -20(%rsp)
    movq    -24(%rsp), %rax
    movl    $3, -16(%rsp)
    movq    %rax, -40(%rsp)
    movl    $3, -32(%rsp)
    movl    -32(%rsp), %eax
    movl    -36(%rsp), %edx
    movl    -40(%rsp), %ecx
    imull   %ecx, %eax
    addl    %edx, %eax
    ret
    .cfi_endproc

So if you are convinced that your function is not inlined, first make sure that you enable optimization in the compiler options.

Upvotes: 1

user1084944
user1084944

Reputation:

Did you remember to compile with optimizations? Some compilers have an attribute to force inlining, even when the compiler doesn't want to: see this question.

But it probably has already; you can try having your compiler output the assembly code and try to check for sure that way.

It is not implausible that index calculations can be a significant fraction of your time -- e.g. if your algorithm is reading from a matrix, a little bit of calculation, then writing back, then index calculations really are a significant fraction of your compute time.

Or, you've written your code in a way that the compiler can't prove that width remains constant throughout your loops*, and so it has to reread it from memory every time, just to be sure. Try copying width to a local variable and use that in your inner loops.

Now, you've said that this takes 8% of your time -- that means it is unlikely that you can possibly get anything more than an 8% improvement to your runtime, and probably much less. If that's really worth it, then the thing to do is probably to fundamentally change how you iterate through the array.

e.g.

  • if you tend to access the matrix in a linear fashion, you could write some sort of two-dimensional iterator class that you can advance up, down, left, or right, and it will use additions everywhere instead of multiplication
  • same thing, but writing an "index" class that just holds the numbers rather than pretending to be a pointer
  • if width is a compile-time constant, you could make it explicitly so, e.g. as a template parameter, and your compiler might be able to do more clever things with the multiplication

*: You could have done something silly, like put the data structure for your matrix in the very memory where you're storing the matrix entries! So when you update the matrix, you might change the width. The compiler has to guard against these loopholes, so it can't do optimizations it 'obviously should' be able to do. And sometimes, the sort of thing that a loophole in one context can well be the programmer's obvious intent in another context. Generally speaking, these sorts of loop holes tend to be all over the place, and the compiler is better at finding these loopholes than humans are at noticing them.

Upvotes: 5

Related Questions