Reputation: 205
(machine is x86 64 bit running SL6)
I was trying to see if I can optimize memset on my 64 bit machine. As per my understanding memset goes byte by byte and sets the value. I assumed that if I do in units of 64 bits, it would be faster. But somehow it takes more time. Can someone take a look at my code and suggest why ?
/* Code */
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
void memset8(unsigned char *dest, unsigned char val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void memset32(uint32_t *dest, uint32_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
void
memset64(uint64_t *dest, uint64_t val, uint32_t count)
{
while (count--)
*dest++ = val;
}
#define CYCLES 1000000000
int main()
{
clock_t start, end;
double total;
uint64_t loop;
uint64_t val;
/* memset 32 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset32((uint32_t*)&val, 0, 2);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset32 %g\n", total);
/* memset 64 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset64(&val, 0, 1);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset64 %g\n", total);
/* memset 8 */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset8((unsigned char*)&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset8 %g\n", total);
/* memset */
start = clock();
for (loop = 0; loop < CYCLES; loop++) {
val = 0xDEADBEEFDEADBEEF;
memset(&val, 0, 8);
}
end = clock();
total = (double)(end-start)/CLOCKS_PER_SEC;
printf("Timetaken memset %g\n", total);
printf("-----------------------------------------\n");
}
/*Result*/
Timetaken memset32 12.46
Timetaken memset64 7.57
Timetaken memset8 37.12
Timetaken memset 6.03
-----------------------------------------
Looks like the standard memset is more optimized than my implementation. I tried looking into code and everywhere is see that implementation of memset is same as what I did for memset8. When I use memset8, the results are more like what I expect and very different from memset. Can someone suggest what am I doing wrong ?
Upvotes: 6
Views: 3659
Reputation: 33669
I think it's worth exploring how to write a faster memset particularly with GCC (which I assume you are using with Scientific Linux 6) in C/C++. Many people assume the standard implementation is optimized. This is not necessarily true. If you see table 2.1 of Agner Fog's Optimizing Software in C++ manuals he compares memcpy for for several different compilers and platforms to his own assembly optimized version of memcpy. Memcpy in GCC at the time really underperformed (but the Mac version was good). He claims the built in functions are even worse and recommends using -no-builtin
. GCC in my experience is very good at optimizing code but its library functions (and built in functions) are not very optimized (with ICC it's the other way around).
It would be interesting to see how good you could do using intrinsics. If you look at his asmlib you can see how he implements memset
with SSE and AVX (it would be interesting to compare this to the Apple's optimized version Stephen Canon posted).
With AVX you can see he writes 32 bytes at a time.
K100: ; Loop through 32-bytes blocks. Register use is swapped
; Rcount = end of 32-bytes blocks part
; Rdest = negative index from the end, counting up to zero
vmovaps [Rcount+Rdest], ymm0
add Rdest, 20H
jnz K100
vmovaps in this case is the same as the intrinsic _mm256_store_ps
. Maybe GCC has improved since then but you might be able to beat GCC's implementation of memset
using intrinsics. If you don't have AVX you certainly have SSE (all x86 64bit do) so you could look at the SSE version of his code to see what you could do.
Here is a start for your memset32 funcion assuming the array fits in the L1 cache. If the array does not fit in the cache you want to do a non temporal store with _mm256_stream_ps
. For a general function you need several cases also including cases when the memory is not 32 byte aligned.
#include <immintrin.h>
int main() {
int count = (1<<14)/sizeof(int);
int* dest = (int*)_mm_malloc(sizeof(int)*count, 32); // 32 byte aligned
int val = 0xDEADBEEFDEADBEEF;
__m256 val8 = _mm256_castsi256_ps(_mm256_set1_epi32(val));
for(int i=0; i<count; i+=8) {
_mm256_store_ps((float*)(dest+i), val8);
}
}
Upvotes: 0
Reputation: 16305
As per my understanding memset goes byte by byte and sets the value.
The details of what the memset
facility does are implementation dependent. Relying on this is usually a good thing, because the I'm sure the implementors have extensive knowledge of the system and know all kind of techniques to make things as fast as possible.
To elaborate a little more, lets look at:
memset(&val, 0, 8);
When the compiler sees this it can notice a few things like:
and then choose the right instructions to use depending on where val
or &val
is (in a register, in memory, ...). But if memset
is stuck needing to be a function call (like your implementations), none of those optimizations are possible. Even if it can't make compile time decisions like:
memset(&val, x, y); // no way to tell at compile time what x and y will be...
you can be assured that there's a function call written in assembler that will be as fast as possible for your platform.
Upvotes: 3
Reputation: 106197
Actual memset
implementations are typically hand-optimized in assembly, and use the widest aligned writes available on the targeted hardware. On x86_64 that will be at least 16B stores (movaps
, for example). It may also take advantage of prefetching (this is less common recently, as most architectures have good automatic streaming prefetchers for regular access patterns), streaming stores or dedicated instructions (historically rep stos
was unusably slow on x86, but it is quite fast on recent microarchitectures). Your implementation does none of these things. It should not be terribly surprising that the system implementation is faster.
As an example, consider the implementation used in OS X 10.8 (which has been superseded in 10.9). Here’s the core loop for modest-sized buffers:
.align 4,0x90
1: movdqa %xmm0, (%rdi,%rcx)
movdqa %xmm0, 16(%rdi,%rcx)
movdqa %xmm0, 32(%rdi,%rcx)
movdqa %xmm0, 48(%rdi,%rcx)
addq $64, %rcx
jne 1b
This loop will saturate the LSU when hitting cache on pre-Haswell microarchitectures at 16B/cycle. An implementation based on 64-bit stores like your memset64
cannot exceed 8B/cycle (and may not even achieve that, depending on the microarchitecture in question and whether or not the compiler unrolls your loop). On Haswell, an implementation that uses AVX stores or rep stos
can go even faster and achieve 32B/cycle.
Upvotes: 10