Lee Jacobs
Lee Jacobs

Reputation: 1755

memory not aligned properly?

I'm trying to use aligned operations in SSE and I'm having an issue (surprise).

typedef struct _declspec(align(16)) Vec4 {  
    float x;  
    float y;  
    float z;  
    float w;  
};

Vec4 SSE_Add(const Vec4 &a, const Vec4 &b) {  
    _declspec(align(16)) Vec4 return_val;  

    _asm { 
        MOV EAX, a                    // Load pointers into CPU regs
        MOV EBX, b

        MOVAPS XMM0, [EAX]            // Move unaligned vectors to SSE regs
        MOVAPS XMM1, [EBX]

        ADDPS XMM0, XMM1              // Add vector elements
        MOVAPS [return_val], XMM0     // Save the return vector
    }

    return return_val;
}

I get an access violation at return return_val. Is this an alignment issue? How can I correct this?

Upvotes: 0

Views: 549

Answers (2)

catscradle
catscradle

Reputation: 1719

I found out that the problem is with EBX register. If you push/pop EBX, then it works. I'm not sure why though, so if anyone can explain this - please do.

Edit: I've looked into the disassembly and at the beginning of a function it stores stack pointer in the EBX:

mov ebx, esp

So you better make sure not to lose it.

Upvotes: 2

migle
migle

Reputation: 949

This is a bit compiler dependent... Isn't the correct thing to write: movaps return_val, xmm0

Why don't you show us the generated code?

The way you are writing this is a lot worse than if you let the compiler do it on its own.

  • This function should be inlinable and translate to a single instruction in the best case, if you write it like this it cannot be inlined.
  • This function could receive its arguments in registers in Intel 64 and return its result in a register, if you write it like this you force using the stack.
  • This function could use return value optimization, writing it like this forces you to write xmm0 to the return_val variable which will have to be copied a second time.

So... aligned versus unaligned MOVPS is your least concern.

Why not just, in portable code, write:

inline void add(const float *__restrict__ a, const float *__restrict__ b, float *__restrict__ r)
{
    for (int i = 0; i != 4; ++i) r[i] = a[i] + b[i];
}

Upvotes: 0

Related Questions