Reliable information about x86 string instruction performance?

Question

Common widsom is that rep movsb is much slower than rep movsd (or on 64-bit, rep movsq) when performing identical operations. However, I've been testing on a few modern machines, and the run times are coming out identical (up to measurement noise) across a huge range of buffer sizes (10 bytes to 2 megs). So far I have just tested on 2 machines (32-bit Intel Atom D510 and 64-bit AMD FX 8120).

Are there any modern x86 (32- or 64-bit) machines where rep movsb is slower than rep movsd (or rep movsq)?
If not, what was the last machine where the difference was significant, and how significant was it?

I'm asking this question from a standpoint of wanting to avoid cargo-culting a bunch of tests to break memory up into unaligned head/tail and aligned middle for the sake of using rep movsd or rep movsq if there's no actual benefit to doing this...

user555045 · Accepted Answer

Lots of benchmarks here: instlatx64.atw.hu

For example (Intel Core 2 Duo E6700):

REP MOVSB   BW in L1D:13.04 B/c  34829MiB/s
REP MOVSW   BW in L1D:13.29 B/c  35493MiB/s
REP MOVSD   BW in L1D:13.40 B/c  35783MiB/s

Which shows that there is a difference, but it's tiny.

This one for SandyBridge is a little weird:

REP MOVSB   BW in L1D:25.50 B/c  86986MiB/s
REP MOVSW   BW in L1D:18.09 B/c  61721MiB/s
REP MOVSD   BW in L1D:27.47 B/c  93693MiB/s

Seems there is a big difference on some Atoms (seems to have disappeared with the D5xx, so you just missed it):

REP MOVSB   BW in L1D: 0.53 B/c    990MiB/s
REP MOVSW   BW in L1D: 1.93 B/c   3598MiB/s
REP MOVSD   BW in L1D: 3.74 B/c   6960MiB/s

I haven't found such big difference on anything else that can be considered new.

Reliable information about x86 string instruction performance?

Answers (1)

Related Questions