Reputation: 1
How could i go about replicating a x64 MOVQ (move quad word) instruction in x86 assembly?
For example. Given:
movq xmm5, [esi+2h]
movq [edi+f1h], xmm5
Would this work? :
push eax
push edx
mov eax, [esi+2h]
mov edx, [esi+6h] ; +4 byte offset
mov [edi+f1h], eax
mov [edi+f5h], edx ; +4 byte offset
pop edx
pop eax
Upvotes: 0
Views: 902
Reputation: 365247
SSE2 movq xmm, xmm/m64
works in 32-bit code (on CPUs that support it). The code you showed already used 32-bit addressing modes, so it will work unchanged in 32-bit mode. There's another form of movq
that only works in 64-bit mode, that's the movq xmm, r64/m64
. The memory-source form of the same opcode that lets you do movq xmm0, rax
.
But anyway, 32-bit SSE2:
movq xmm5, [esi+2h]
movq [edi+f1h], xmm5
If you can only assume SSE1 but not SSE2, you can use movlps
;; xorps xmm5,xmm5 ; optional to break a dependency on old value
movlps xmm5, [esi+2h] ; merges into xmm5: false dependency
movlps [edi+f1h], xmm5
Depending on what you're doing, it could possibly be worth it to use MMX if you have it but not SSE1:
movq mm0, [esi+2h]
movq [edi+f1h], mm0
; emms required later, after a loop.
If you really want a single-instruction 64-bit load/store so it's atomic (on P5 and later) for aligned addresses, then fild
/fistp
is a good choice. (gcc uses this for std::atomic<int64_t>
with -m32 -mno-sse
.)
It will never munge your data unless you (or MSVC++'s CRT) have the x87 precision bits set to less than a 64-bit mantissa.
fild qword ptr [esi+2h]
fistp qword ptr [edi+f1h]
fild
/ fistp
might even have better throughput for copying scattered 64-bit chunks than using 32-bit integer load/store, at least on modern CPUs. For contiguous copies of maybe 32 or 64 bytes or larger, use rep movsd
. (Usually the threshold for rep movsd
being worth it is much higher, but we're talking about without SIMD vectors and with only 32-bit integer or 64-bit fild
/fistp
multi-uop load/store instructions.)
With plain integer, just pick a register you can clobber. (Or in MSVC inline asm, let the compiler worry about saving it.) If registers are tight, only use one (if your src and dst are known not to overlap):
mov eax, [esi+2h]
mov [edi+f1h], eax
mov eax, [esi+2h + 4] ; write the +4 separately in the addressing mode as documentation
mov [edi+f1h + 4], eax
If you can spare 2 registers, then yes it's probably better to do both loads and then both stores.
Upvotes: 2