Adrien Givry
Adrien Givry

Reputation: 965

Can we beat the compiler?

Today I was writting a Pool Allocator when I came up with a question:
Is it possible to beat the compiler?

By beating the compiler I mean writing a code that performs a memory allocation faster (Less clock cycles) than its simplest version (Allocation variables on the stack, one by one).

So I came up with a very simple BytePool:

template <size_t PoolSize>
class BytePool
{
public:
    template <typename T>
    T& At(size_t p_index)
    {
        return (T&)m_data[p_index * sizeof(T)];
    }

private:
    std::byte m_data[PoolSize];
};

This simple code allow me to allocate a byte array on the stack once and access it as if it was an array of T

In order to manipulate this array, I made a macro:

#define is(type, slot) bytePool.At<type>(slot)

This macro allows me to write: #define a is (int, 0x0000) such as a is a pseudo-variable, that points to bytePool[sizeof(int) * 0x0000].

Using this macro, I made a simple piece of code that performs basic operations with some numbers (Some defined at compile time, and some at runtime, such as b and c):

BytePool<sizeof(int) * 6> bytePool;

#define is(type, slot) bytePool.At<type>(slot)

#define a is (int, 0x0000)
#define b is (int, 0x0001)
#define c is (int, 0x0002)
#define d is (int, 0x0003)
#define e is (int, 0x0004)
#define f is (int, 0x0004)

a = 0;
b = (int)time(nullptr);
c = (int)__rdtsc();
d = 2 * b;
e = c - 3;
f = 18 ^ 2;

a = ~(b * c) * d + e / f;

#undef a
#undef b
#undef c
#undef d
#undef e
#undef f

Fun! This code looks like I manually assigned memory slots to my variables.

The equivalent without using the ByteAllocator would be:

int a;
int b;
int c;
int d;
int e;
int f;

a = 0;
b = (int)time(nullptr);
c = (int)__rdtsc();
d = 2 * b;
e = c - 3;
f = 18 ^ 2;

a = ~(b * c) * d + e / f;

The question that I asked myself at this point was:
Which one of these approaches is the better one?

Naturally I was sure that allocating memory once was faster. So I guessed my BytePool approach was faster.

Now, let's listen to the compiler. I wrote some code to benchmark:

#include <iostream>
#include <intrin.h>
#include <ctime>

template <size_t PoolSize>
class BytePool
{
public:
    template <typename T>
    T& At(size_t p_index)
    {
        return (T&)m_data[p_index * sizeof(T)];
    }

private:
    std::byte m_data[PoolSize];
};

void Stack()
{
    int a;
    int b;
    int c;
    int d;
    int e;
    int f;

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;
}

void Pool()
{
    BytePool<sizeof(int) * 6> bytePool;

    #define is(type, slot) bytePool.At<type>(slot)

    #define a is (int, 0x0000)
    #define b is (int, 0x0001)
    #define c is (int, 0x0002)
    #define d is (int, 0x0003)
    #define e is (int, 0x0004)
    #define f is (int, 0x0004)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
}

void FastPool()
{
    int fastBytePool[6];

    #define a   *(fastBytePool)
    #define b   *(fastBytePool + 0x0001)
    #define c   *(fastBytePool + 0x0002)
    #define d   *(fastBytePool + 0x0003)
    #define e   *(fastBytePool + 0x0004)
    #define f   *(fastBytePool + 0x0005)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f
}

void FastHeapPool()
{
    int* fastBytePool = new int[6];

    #define a   *(fastBytePool)
    #define b   *(fastBytePool + 0x0001)
    #define c   *(fastBytePool + 0x0002)
    #define d   *(fastBytePool + 0x0003)
    #define e   *(fastBytePool + 0x0004)
    #define f   *(fastBytePool + 0x0005)

    a = 0;
    b = (int)time(nullptr);
    c = (int)__rdtsc();
    d = 2 * b;
    e = c - 3;
    f = 18 ^ 2;

    a = ~(b * c) * d + e / f;

    #undef a
    #undef b
    #undef c
    #undef d
    #undef e
    #undef f

    delete[] fastBytePool;
}

size_t Benchmark(void (p_function)(), size_t p_iterations)
{
    size_t cycleSum = 0;

    for (size_t it = 0; it < p_iterations; ++it)
    {
        size_t startCycles = __rdtsc();
        p_function();
        cycleSum += __rdtsc() - startCycles;
    }

    return cycleSum / p_iterations;
}

int main()
{
    const size_t iterations = 100000;

    while (true)
    {
        std::cout << "Stack():        \t" << Benchmark(Stack, iterations)           <<  "\tcycles\n";
        std::cout << "Pool():         \t" << Benchmark(Pool, iterations)            <<  "\tcycles\n";
        std::cout << "FastPool():     \t" << Benchmark(FastPool, iterations)        <<  "\tcycles\n";
        std::cout << "FastHeapPool(): \t" << Benchmark(FastHeapPool, iterations)    <<  "\tcycles\n";

        std::cin.get();

        system("CLS");
    }

    return 0;
}

The 4 tests are:

Here are the result with MSVC v142 using C++17:

Debug
Debug Benchmark

Release
Release Benchmark

Well... Not what I expected!

So now, my question is:

Is there any approach that can beat the classic approach (6 allocations on the stack), and why allocating 6 times the size of an int is equals to allocating once the size of 6 ints

I'm only speaking about memory, not about operations optimization

Upvotes: 1

Views: 288

Answers (2)

jan.sende
jan.sende

Reputation: 860

I will ignore your code as I am not able to say anything about which version should be faster...

Anyways, you seem to have a misconception about how compilers work. None of the modern compilers translates the programme line by line. They all generate a so-called abstract syntax tree (AST) - a representation of what your programme does. This syntax tree is then heavily modified to get you the best performance optimizations possible. (Loops are unrolled, values pre-calculated,...) At last, the backend of the compiler produces an executable from the syntax tree, which is optimized for your system. (Machine specific instructions might be used, if available.)

Due to all these stages, it is pretty hard to guess what machine code is produced from your c++. In lots of cases, the compiler will even generate the same machine code from completely different programming approaches. Therefore, in your example, it is impossible to say which code runs faster without looking at the binary.

It is very likely that your fast version runs slower because of the way you wrote it. Compilers like simple code. Yet, your version is written in a complicated way, making it hard for the compiler to optimize it.

If you interested in compilers and optimizations, you should check out:

Upvotes: 2

robthebloke
robthebloke

Reputation: 9668

Your test is horribly flawed. The methods Stack(), Pool(), and FastPool() will boil down to NOPs (they don't DO anything!!). new/delete however have possible side effects, so that accounts for the release performance difference. Now, it's time you might need to learn what stack allocation actually does! If a stack allocated variable is used within a method, it will most likely be a register (unless it's a non pod type with side effects), and whatever crazy concept you attempt to create in order to mimic that with memory, will simply be orders of magnitude slower due to latencies, cache misses, etc.

In the olden days, we used to have the register keyword to differentiate a stack allocated var and a register. Not any more, because it was basically pointless. These days stack allocations only occur when you run out of registers, and you need to swap a register value out to the stack space.

Upvotes: 4

Related Questions