Reputation: 215507
Common widsom is that rep movsb
is much slower than rep movsd
(or on 64-bit, rep movsq
) when performing identical operations. However, I've been testing on a few modern machines, and the run times are coming out identical (up to measurement noise) across a huge range of buffer sizes (10 bytes to 2 megs). So far I have just tested on 2 machines (32-bit Intel Atom D510 and 64-bit AMD FX 8120).
Are there any modern x86 (32- or 64-bit) machines where rep movsb
is slower than rep movsd
(or rep movsq
)?
If not, what was the last machine where the difference was significant, and how significant was it?
I'm asking this question from a standpoint of wanting to avoid cargo-culting a bunch of tests to break memory up into unaligned head/tail and aligned middle for the sake of using rep movsd
or rep movsq
if there's no actual benefit to doing this...
Upvotes: 14
Views: 1714
Reputation: 64913
Lots of benchmarks here: instlatx64.atw.hu
For example (Intel Core 2 Duo E6700):
REP MOVSB BW in L1D:13.04 B/c 34829MiB/s
REP MOVSW BW in L1D:13.29 B/c 35493MiB/s
REP MOVSD BW in L1D:13.40 B/c 35783MiB/s
Which shows that there is a difference, but it's tiny.
This one for SandyBridge is a little weird:
REP MOVSB BW in L1D:25.50 B/c 86986MiB/s
REP MOVSW BW in L1D:18.09 B/c 61721MiB/s
REP MOVSD BW in L1D:27.47 B/c 93693MiB/s
Seems there is a big difference on some Atoms (seems to have disappeared with the D5xx, so you just missed it):
REP MOVSB BW in L1D: 0.53 B/c 990MiB/s
REP MOVSW BW in L1D: 1.93 B/c 3598MiB/s
REP MOVSD BW in L1D: 3.74 B/c 6960MiB/s
I haven't found such big difference on anything else that can be considered new.
Upvotes: 16