user606521
user606521

Reputation: 15454

Using SSE to speed up computation - store, load and alignment

In my project I have implemented basic class CVector. This class contains float* pointer to raw floating point array. This array is allocated dynamicly using standard malloc() function.

Now I have to speed up some computation using such vectors. Unfortunately as the memory isn't alocated using _mm_malloc() it is not aligned.

As I understand I have two options:

1) Rewrite code which allocates memory to use _mm_malloc() and for example use the code like this:

void sub(float* v1, float* v2, float* v3, int size) 
{  
    __m128* p_v1 = (__m128*)v1;  
    __m128* p_v2 = (__m128*)v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        res = _mm_sub_ps(*p_v1,*p_v2);  
        _mm_store_ps(v3,res);  
        ++p_v1;  
        ++p_v2;  
        v3 += 4;  
    }
}

2) The second option is to use _mm_loadu_ps() instruction to load __m128 from unaligned memory and then use it for computation.

void sub(float* v1, float* v2, float* v3, int size)
{  
    __m128 p_v1;  
    __m128 p_v2;  
    __m128 res;

    for(int i = 0; i < size/4; ++i)  
    {  
        p_v1 = _mm_loadu_ps(v1);   
        p_v2 = _mm_loadu_ps(v2);  
        res = _mm_sub_ps(p_v1,p_v2);    
        _mm_store_ps(v3,res);  
        v1 += 4;  
        v2 += 4;  
        v3 += 4;  
    }
}

So my question is which option will be better or faster?

Upvotes: 9

Views: 7016

Answers (2)

cppanda
cppanda

Reputation: 1315

take a look at bullet physics. it's been used for a a handful of movies and well known games (GTA4 and others). You can either take a look at their super optimized vector, matrix and other math classes, or just use them instead. it's published under zlib license so you can just use it as you wish. Don't reinvent the wheel. Bullet, nvidia physx, havok and other physics libraries are well tested and optimized by really smart guys

Upvotes: 1

Hans Passant
Hans Passant

Reputation: 942020

Reading unaligned SSE values is extraordinary expensive. Check the Intel manuals, volume 4, chapter 2.2.5.1. The core type makes a difference, i7 has extra hardware to make it less costly. But reading a value that straddles the cpu cache line boundary is still 4.5 times slower than reading an aligned value. It is ten times slower on previous architectures.

That's massive, get the memory aligned to avoid that perf hit. Never heard of _mm_malloc, use _aligned_malloc() from the Microsoft CRT to get properly aligned memory from the heap.

Upvotes: 14

Related Questions