Reputation: 1191
I profiled my code and found that one inline function takes about 8% of the samples. The function is to convert matrix subscripts to indices. It is quite like the matlab function sub2ind
.
inline int sub2ind(const int sub_height, const int sub_width, const int width) {
return sub_height * width + sub_width;
}
I guess the compiler does not perform inline expansion, but I don't know how to check that out.
Is there any way to improve this? Or explicitly let the compiler perform inline expansion?
Upvotes: 5
Views: 281
Reputation: 55685
As @user3528438 mentioned, you can look at the assembly output. Consider the following example:
inline int sub2ind(const int sub_height, const int sub_width, const int width) {
return sub_height * width + sub_width;
}
int main() {
volatile int n[] = {1, 2, 3};
return sub2ind(n[0], n[1], n[2]);
}
Compiling it without optimization (g++ -S test.cc
) results in the following code with sub2ind
not inlined:
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movl $1, -16(%rbp)
movl $2, -12(%rbp)
movl $3, -8(%rbp)
movq -16(%rbp), %rax
movq %rax, -32(%rbp)
movl -8(%rbp), %eax
movl %eax, -24(%rbp)
movl -24(%rbp), %edx
movl -28(%rbp), %ecx
movl -32(%rbp), %eax
movl %ecx, %esi
movl %eax, %edi
call _Z7sub2indiii ; call to sub2ind
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
while compiling with optimization (g++ -S -O3 test.cc
) results in sub2ind
being inlined and mostly optimized away:
main:
.LFB1:
.cfi_startproc
movl $1, -24(%rsp)
movl $2, -20(%rsp)
movq -24(%rsp), %rax
movl $3, -16(%rsp)
movq %rax, -40(%rsp)
movl $3, -32(%rsp)
movl -32(%rsp), %eax
movl -36(%rsp), %edx
movl -40(%rsp), %ecx
imull %ecx, %eax
addl %edx, %eax
ret
.cfi_endproc
So if you are convinced that your function is not inlined, first make sure that you enable optimization in the compiler options.
Upvotes: 1
Reputation:
Did you remember to compile with optimizations? Some compilers have an attribute to force inlining, even when the compiler doesn't want to: see this question.
But it probably has already; you can try having your compiler output the assembly code and try to check for sure that way.
It is not implausible that index calculations can be a significant fraction of your time -- e.g. if your algorithm is reading from a matrix, a little bit of calculation, then writing back, then index calculations really are a significant fraction of your compute time.
Or, you've written your code in a way that the compiler can't prove that width
remains constant throughout your loops*, and so it has to reread it from memory every time, just to be sure. Try copying width
to a local variable and use that in your inner loops.
Now, you've said that this takes 8% of your time -- that means it is unlikely that you can possibly get anything more than an 8% improvement to your runtime, and probably much less. If that's really worth it, then the thing to do is probably to fundamentally change how you iterate through the array.
e.g.
width
is a compile-time constant, you could make it explicitly so, e.g. as a template parameter, and your compiler might be able to do more clever things with the multiplication*: You could have done something silly, like put the data structure for your matrix in the very memory where you're storing the matrix entries! So when you update the matrix, you might change the width. The compiler has to guard against these loopholes, so it can't do optimizations it 'obviously should' be able to do. And sometimes, the sort of thing that a loophole in one context can well be the programmer's obvious intent in another context. Generally speaking, these sorts of loop holes tend to be all over the place, and the compiler is better at finding these loopholes than humans are at noticing them.
Upvotes: 5