ELHASKSERVERS
ELHASKSERVERS

Reputation: 217

Assembly Jump with Multiple plus or do plus before jump (performance)

in Assembly, if i have a JUMP table with the address of over 2000 labels:

.TABLE:
     DD .case0
     DD .case1
     DD .case2
     DD .case3
     DD .case4
     ...
     ...
     ...
     DD .case2000

which way is better for addressing to jump:

way 1:

mov    r12d, .TABLE    ; r12d or any other registers
mov    ebx, [r13d]     ; r13d holds the id of case * 4 so we don't need to '4 * ebx'
add    ebx, r12d       ; ebx = address for Jumping
jmp    ebx

way 2: (Same way 1 but 'add ebx, r12d' is removed and changed to 'jmp [ebx+r12d]')

mov    r12d, .TABLE    ; r12d or any other registers
mov    ebx, [r13d]     ; r13d holds the id of case * 4 so we don't need to '4 * ebx'
jmp    [ebx+r12d]

way 3:

mov    ebx, [r13d]     ; r13d holds the id of case * 4 so we don't need to '4 * ebx'
jmp    [ebx + .TABLE]

in the 'way 1', we have source code size problem due to extra functions but i think it has better performance than other ways in jumping because im going to have about 2000 jumps (Irregular jump (May be from case0 to case1000 or ...)

So for jumping performance, which way is better in a source code that has a lot of JUMP ?

Upvotes: 0

Views: 98

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364160

Using 32-bit address size is a good choice if you can get away with it to compress the jump table vs. using qword pointers for 64-bit mode.

Otherwise you'd want to load 16-bit or 32-bit offsets (movzx or mov) and add to some 64-bit base address from a RIP-relative LEA for 64-bit code. (Which also makes it position-independent).

fewest instructions is not always a solution !

But in this case fewest instructions is also fewest uops. [disp32 + reg] addressing modes are efficient.

If you were going to consider using more instructions, it would be to load the pointer into a register for jmp reg instead of using jmp [mem], not simplifying addressing modes even more.

https://agner.org/optimize/ shows that jmp mem on Intel Sandybridge family is still only 1 fused-domain uop, with the load micro-fused into the port 6 jump uop. So a separate mov load would actually cost more uops in the front-end.

(An indexed addressing mode would probably unlaminate; jmp [.TABLE + ebx*4] would cost 2 uops for the issue/rename stage but still only 1 in the decoders and uop cache. But it seems you have a byte offset stored in memory for some reason, so you don't need a scaled index.)

Upvotes: 1

Related Questions