Can we beat the compiler?

Question

Today I was writting a Pool Allocator when I came up with a question:
Is it possible to beat the compiler?

By beating the compiler I mean writing a code that performs a memory allocation faster (Less clock cycles) than its simplest version (Allocation variables on the stack, one by one).

So I came up with a very simple BytePool:

template 
class BytePool
{
public:
    template 
    T& At(size_t p_index)
    {
        return (T&)m_data[p_index * sizeof(T)];
    }

private:
    std::byte m_data[PoolSize];
};

This simple code allow me to allocate a byte array on the stack once and access it as if it was an array of T

In order to manipulate this array, I made a macro:

#define is(type, slot) bytePool.At(slot)

This macro allows me to write: #define a is (int, 0x0000) such as a is a pseudo-variable, that points to bytePool[sizeof(int) * 0x0000].

Using this macro, I made a simple piece of code that performs basic operations with some numbers (Some defined at compile time, and some at runtime, such as b and c):

BytePool bytePool;

#define is(type, slot) bytePool.At(slot)

#define a is (int, 0x0000)
#define b is (int, 0x0001)
#define c is (int, 0x0002)
#define d is (int, 0x0003)
#define e is (int, 0x0004)
#define f is (int, 0x0004)

a = 0;
b = (int)time(nullptr);
c = (int)__rdtsc();
d = 2 * b;
e = c - 3;
f = 18 ^ 2;

a = ~(b * c) * d + e / f;

#undef a
#undef b
#undef c
#undef d
#undef e
#undef f

Fun! This code looks like I manually assigned memory slots to my variables.

The equivalent without using the ByteAllocator would be:

int a;
int b;
int c;
int d;
int e;
int f;

a = 0;
b = (int)time(nullptr);
c = (int)__rdtsc();
d = 2 * b;
e = c - 3;
f = 18 ^ 2;

a = ~(b * c) * d + e / f;

The question that I asked myself at this point was:
Which one of these approaches is the better one?

Allocating sizeof(int) on the stack 6 times
Allocation sizeof(int) * 6 on the stack 1 time

Naturally I was sure that allocating memory once was faster. So I guessed my BytePool approach was faster.

Now, let's listen to the compiler. I wrote some code to benchmark:

#include 
#include 
#include 

template 
class BytePool
{
public:
    template 
    T& At(size_t p_index)
    {
        return (T&)m_data[p_index * sizeof(T)];
    }

private:
    std::byte m_data[PoolSize];
};

void Stack()
{
    int a;
    int b;
    int c;
    int d;
    int e;
    int f;

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;
}

void Pool()
{
    BytePool bytePool;

    #define is(type, slot) bytePool.At(slot)

    #define a is (int, 0x0000)
    #define b is (int, 0x0001)
    #define c is (int, 0x0002)
    #define d is (int, 0x0003)
    #define e is (int, 0x0004)
    #define f is (int, 0x0004)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
}

void FastPool()
{
    int fastBytePool[6];

    #define a   *(fastBytePool)
    #define b   *(fastBytePool + 0x0001)
    #define c   *(fastBytePool + 0x0002)
    #define d   *(fastBytePool + 0x0003)
    #define e   *(fastBytePool + 0x0004)
    #define f   *(fastBytePool + 0x0005)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
}

void FastHeapPool()
{
    int* fastBytePool = new int[6];

    #define a   *(fastBytePool)
    #define b   *(fastBytePool + 0x0001)
    #define c   *(fastBytePool + 0x0002)
    #define d   *(fastBytePool + 0x0003)
    #define e   *(fastBytePool + 0x0004)
    #define f   *(fastBytePool + 0x0005)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f

    delete[] fastBytePool;
}

size_t Benchmark(void (p_function)(), size_t p_iterations)
{
    size_t cycleSum = 0;

    for (size_t it = 0; it < p_iterations; ++it)
    {
        size_t startCycles = __rdtsc();
        p_function();
        cycleSum += __rdtsc() - startCycles;
    }

    return cycleSum / p_iterations;
}

int main()
{
    const size_t iterations = 100000;

    while (true)
    {
        std::cout << "Stack():        	" << Benchmark(Stack, iterations)           <<  "	cycles
";
        std::cout << "Pool():         	" << Benchmark(Pool, iterations)            <<  "	cycles
";
        std::cout << "FastPool():     	" << Benchmark(FastPool, iterations)        <<  "	cycles
";
        std::cout << "FastHeapPool(): 	" << Benchmark(FastHeapPool, iterations)    <<  "	cycles
";

        std::cin.get();

        system("CLS");
    }

    return 0;
}

The 4 tests are:

Stack (The classic way)
Pool (Pre-allocating a byte pool on the stack)
FastPool (Pre-allocating a byte pool on the stack without class abstraction, no method call)
FastHeapPool (Pre-allocating a byte pool on the heap without class abstraction, no method call)

Here are the result with MSVC v142 using C++17:

Debug

Release

Well... Not what I expected!

FastPool appears equals to the classic way. Meaning that 6 allocations isn't very different from 1 big allocation.
The simple Pool (Using BytePool class) is very slow, I guess it is due to method calls, which seems being optimized in release mode.
FastHeapPool is a disaster, even in release mode, heap allocation and access seems being very slow (Which is what I expected)

So now, my question is:

Is there any approach that can beat the classic approach (6 allocations on the stack), and why allocating 6 times the size of an int is equals to allocating once the size of 6 ints

I'm only speaking about memory, not about operations optimization

robthebloke · Accepted Answer

Your test is horribly flawed. The methods Stack(), Pool(), and FastPool() will boil down to NOPs (they don't DO anything!!). new/delete however have possible side effects, so that accounts for the release performance difference. Now, it's time you might need to learn what stack allocation actually does! If a stack allocated variable is used within a method, it will most likely be a register (unless it's a non pod type with side effects), and whatever crazy concept you attempt to create in order to mimic that with memory, will simply be orders of magnitude slower due to latencies, cache misses, etc.

In the olden days, we used to have the register keyword to differentiate a stack allocated var and a register. Not any more, because it was basically pointless. These days stack allocations only occur when you run out of registers, and you need to swap a register value out to the stack space.

Can we beat the compiler?

Answers (2)

Related Questions