Clang sizeof("literal") optimization

Question

Experiencing with C++, I tried to understand the performance difference between sizeof and strlen for string literal.

Here my small benchmark code:

#include 
#include 

#define LOOP_COUNT 1000000000

unsigned long long rdtscl(void)
{
    unsigned int lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

int main()
{
    unsigned long long before = rdtscl();
    size_t ret;
    for (int i = 0; i < LOOP_COUNT; i++)
        ret = strlen("abcd");
    unsigned long long after = rdtscl();
    std::cout << "Strlen " << (after - before) << " ret=" << ret <<     std::endl;

    before = rdtscl();
    for (int i = 0; i < LOOP_COUNT; i++)
        ret = sizeof("abcd");
    after = rdtscl();
    std::cout << "Sizeof " << (after - before) << " ret=" << ret << std::endl;
}

Compiling with clang++, I get the following result:

clang++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp
./sizeof_vs_strlen

Strlen 36 ret=4
Sizeof 62092396 ret=5

With g++:

g++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp 
./sizeof_vs_strlen

Strlen 30 ret=4
Sizeof 30 ret=5

I strongly suspect that g++ does optimize the loop with sizeof and clang++ don't. Is this result a known issue?

EDIT:

The assembly generated by clang++ for the loop with sizeof:

rdtsc  
mov    %edx,%r14d
shl    $0x20,%r14
mov    $0x3b9aca01,%ecx
xchg   %ax,%ax
add    $0xffffffed,%ecx // 0x400ad0
jne    0x400ad0 
mov    %eax,%eax
or     %rax,%r14
rdtsc

And the one by g++:

rdtsc  
mov    %edx,%esi
mov    %eax,%ecx
rdtsc

I don't understand why clang++ do the {add, jne} loop, it seems useless. Is it a bug?

For information:

g++ (GCC) 5.1.0
clang version 3.6.2 (tags/RELEASE_362/final)

EDIT2: It it likely to be a bug in clang. I opened a bug report.

Mats Petersson · Accepted Answer

I'd call that a bug in clang.

It's actually optimising the sizeof itself, just not the loop.

To make the code much clearer, I change the std::cout to printf, and then you get the following LLVM-IR code for main:

; Function Attrs: nounwind uwtable
define i32 @main() #0 {
entry:
  %0 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
  %asmresult1.i = extractvalue { i32, i32 } %0, 1
  %conv2.i = zext i32 %asmresult1.i to i64
  %shl.i = shl nuw i64 %conv2.i, 32
  %asmresult.i = extractvalue { i32, i32 } %0, 0
  %conv.i = zext i32 %asmresult.i to i64
  %or.i = or i64 %shl.i, %conv.i
  %1 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
  %asmresult.i.25 = extractvalue { i32, i32 } %1, 0
  %asmresult1.i.26 = extractvalue { i32, i32 } %1, 1
  %conv.i.27 = zext i32 %asmresult.i.25 to i64
  %conv2.i.28 = zext i32 %asmresult1.i.26 to i64
  %shl.i.29 = shl nuw i64 %conv2.i.28, 32
  %or.i.30 = or i64 %shl.i.29, %conv.i.27
  %sub = sub i64 %or.i.30, %or.i
  %call2 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* @.str, i64 0, i64 0), i64 %sub, i64 4)
  %2 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
  %asmresult1.i.32 = extractvalue { i32, i32 } %2, 1
  %conv2.i.34 = zext i32 %asmresult1.i.32 to i64
  %shl.i.35 = shl nuw i64 %conv2.i.34, 32
  br label %for.cond.5

for.cond.5:                                       ; preds = %for.cond.5, %entry
  %i4.0 = phi i32 [ 0, %entry ], [ %inc10.18, %for.cond.5 ]
  %inc10.18 = add nsw i32 %i4.0, 19
  %exitcond.18 = icmp eq i32 %inc10.18, 1000000001
  br i1 %exitcond.18, label %for.cond.cleanup.7, label %for.cond.5

for.cond.cleanup.7:                               ; preds = %for.cond.5
  %asmresult.i.31 = extractvalue { i32, i32 } %2, 0
  %conv.i.33 = zext i32 %asmresult.i.31 to i64
  %or.i.36 = or i64 %shl.i.35, %conv.i.33
  %3 = tail call { i32, i32 } asm sideeffect "rdtsc", "={ax},={dx},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !1
  %asmresult.i.37 = extractvalue { i32, i32 } %3, 0
  %asmresult1.i.38 = extractvalue { i32, i32 } %3, 1
  %conv.i.39 = zext i32 %asmresult.i.37 to i64
  %conv2.i.40 = zext i32 %asmresult1.i.38 to i64
  %shl.i.41 = shl nuw i64 %conv2.i.40, 32
  %or.i.42 = or i64 %shl.i.41, %conv.i.39
  %sub13 = sub i64 %or.i.42, %or.i.36
  %call14 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([21 x i8], [21 x i8]* @.str, i64 0, i64 0), i64 %sub13, i64 5)
  ret i32 0
}

As you can see, the call to printf is using the constant 5 from sizeof, and the for.cond.5: starts the empty loop: a "phi" node (which selects the "new" value of i based on where we came from - before loop -> 0, inside loop -> %inc10.18) an increment a conditional branch that jumps back if %inc10.18 is not 100000001.

I don't know enough about clang and LLVM to explain why that empty loop isn't optimised away. But it's certainly not the sizeof that is taking time, as there is no sizeof inside the loop.

It's worth noting that sizeof is ALWAYS a constant at compile-time, it NEVER "takes time" beyond loading a constant value into a register.

Clang sizeof("literal") optimization

Answers (2)

Related Questions

Clang sizeof(&quot;literal&quot;) optimization

Answers (2)

Related Questions

Clang sizeof("literal") optimization