excray
excray

Reputation: 2858

Code runs faster using placement new

I got this class,

Approach 1:

typedef float v4sf __attribute__ (vector_size(16))

class Unit
{
    public:
    Unit(int num)
    {
        u = new float[num];
        v = new float[num];
    }
    void update()
    {
        for(int i =0 ; i < num; i+=4)
        {
            *(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
            //many other equations
        }
    }
    float*u,*v,*t; //and many other variables
}

Approach 2:

Same as approach 1. Except that in approach 2, v,u, and all other variables are allocated on a big chunk pre-allocated on heap, using placement new.

typedef float v4sf __attribute__ (vector_size(16))

class Unit
{
    public:
    Unit(int num)
    {
        buffer = new char[num*sizeof(*u) + sizeof(*v)  /*..and so on for other variables..*/]
        u = new(buffer) float[num];
        v = new(buffer+sizeof(float)*num) float[num];
        //And so on for other variables
    }
    void update()
    {
        for(int i =0 ; i < num; i+=4)
        {
            *(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
            //many other equations
        }
    }
    char* buffer;
    float*u,*v,*t; //and many other variables
}

However, approach 2 is 2x faster. Why is that?

There are around 12 float variables and num is 500K. update() is called 1k times. The speed doesnt factor in the memory allocation. I measure the speed like this:

double start = getTime();
for( int i = 0; i < 1000; i++)
{
   unit->update();
}
double end = getTime();
cout<<end - start;

And this is around 2x faster in approach 2.

Compiler options: gcc -msse4 -o3 -ftree-vectorize.

L1 cache is 256K, Ram is 8GB, pagesize is 4K.

Edit: Corrected the mistake in allocating the variables in approach 2. All variables are allocated in different sections, correctly. Processor is Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Edit: added the source here - Source. Approach 1) gives 69.58s , Approach 2) gives 46.74s. Though not 2x faster, it is still fast.

Upvotes: 2

Views: 832

Answers (4)

Mark Ransom
Mark Ransom

Reputation: 308206

It would be useful to break down the timing and see what part is in the constructor versus what part is in update.

Since update didn't change, the only thing that would affect the timing of it is cache effects on the data. This is more than capable of accounting for a 2X difference.

Upvotes: 1

tumdum
tumdum

Reputation: 2031

Probably because 'Approach 2' has a bug -- all variables u, v, t are places in exactly the same place in memory (you are passing same address to placement new).

Edit: and now you don't... ;)

It's hard to guess without profiling, but it might be related to default allocator. If in first approach you have separate calls to new for each variable there is no guarantee that those variables will be assigned addresses which are close to each other. On the other hand, in second approach you making sure they are as close as possible to each other. This will maximize cache utilization and limit cache misses.

Upvotes: 4

timos
timos

Reputation: 2707

I assume in approach 2 the compiler was able to recognize that the addresses of u an v will not change between calls, and therefore keep some of the pointers used in the equations in the for loop in registers.

Upvotes: 0

Daniel
Daniel

Reputation: 31579

Normal new is actually allocation + construction while placement new is just construction.
So naturally, allocation + 2 construction is faster than allocation + construction + allocation + construction.
Moreover, construction of integral type is nop so in your case it's 2 allocations vs 1 allocation.

Upvotes: 0

Related Questions