Context: A while ago, I stumbled upon this 2001 DDJ article by Alexandrescu: http://www.ddj.com/cpp/184403799 It's about comparing various ways to initialized a buffer to some value. Like what "memset" does for single-byte values. He compared various implementations (memcpy, explicit "for" loop, duff's device) and did not really find the best candidate across all dataset sizes and all compilers. Quote: There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. (...) Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory. Question: does anyone have more recent information about this problem ? Do recent GCC and Visual C++ implementations perform significantly better than 7 years ago ? I'm writing code that has a lifetime of 5+ (probably 10+) years and that will process arrays' sizes from a few bytes to hundred of megabytes. I can't assume that my choices now will still be optimal in 5 years. What should I do: a) use the system's memset (or equivalent) and forget about optimal performance or assume the runtime and compiler will handle this for me. b) benchmark once and for all on various array sizes and compilers and switch at runtime between several routines. c) run the benchmark at program initialization and switch at runtime based on accurate (?) data. Edit: I'm working on image processing software. My array items are PODs and every millisecond counts ! Edit 2: Thanks for the first answers, here are some additional informations: Buffer initialization may represent 20%-40% of total runtime of some algorithms. The platform may vary in the next 5+ years, although it will stay in the "fastest CPU money can buy from DELL" category. Compilers will be some form of GCC and Visual C++. No embedded stuff or exotic architectures on the radar I'd like to hear from people who had to update their software when MMX and SSE appeared, since I'll have to do the same when "SSE2015" becomes available... :)

Reputation: 7827

State of "memset" functionality in C++ with modern compilers

Context:

A while ago, I stumbled upon this 2001 DDJ article by Alexandrescu: http://www.ddj.com/cpp/184403799

It's about comparing various ways to initialized a buffer to some value. Like what "memset" does for single-byte values. He compared various implementations (memcpy, explicit "for" loop, duff's device) and did not really find the best candidate across all dataset sizes and all compilers.

Quote:

There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. (...) Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory.

Question:

does anyone have more recent information about this problem ? Do recent GCC and Visual C++ implementations perform significantly better than 7 years ago ?
I'm writing code that has a lifetime of 5+ (probably 10+) years and that will process arrays' sizes from a few bytes to hundred of megabytes. I can't assume that my choices now will still be optimal in 5 years. What should I do:
- a) use the system's memset (or equivalent) and forget about optimal performance or assume the runtime and compiler will handle this for me.
- b) benchmark once and for all on various array sizes and compilers and switch at runtime between several routines.
- c) run the benchmark at program initialization and switch at runtime based on accurate (?) data.

Edit: I'm working on image processing software. My array items are PODs and every millisecond counts !

Edit 2: Thanks for the first answers, here are some additional informations:

Buffer initialization may represent 20%-40% of total runtime of some algorithms.
The platform may vary in the next 5+ years, although it will stay in the "fastest CPU money can buy from DELL" category. Compilers will be some form of GCC and Visual C++. No embedded stuff or exotic architectures on the radar
I'd like to hear from people who had to update their software when MMX and SSE appeared, since I'll have to do the same when "SSE2015" becomes available... :)

Upvotes: 19

Answers (12)

Skizz

Reputation: 71130

As always with these types of questions, the problem is constrained by factors outside of your control, namely, the bandwidth of the memory. And if the host OS decides to start paging the memory then things get far worse. On Win32 platforms, the memory is paged and pages are only allocated on first use which will generate a big pause every page boundary whilst the OS finds a page to use (this may require another process' page to be paged to disk).

This, however, is the absolute fastest memset ever written:

void memset (void *memory, size_t size, byte value)
{
}

Not doing something is always the fastest way. Is there any way the algorithms can be written to avoid the initial memset? What are the algorithms you're using?

Upvotes: 1

MSalters

Reputation: 180305

The year isn't 2001 anymore. Since then, new versions of Visual Studio have appeared. I've taken the time to study the memset in those. They will use SSE for memset (if available, of course). If your old code was correct, statistically if will now be faster. But you might hit an unfortunate cornercase. I expect the same from GCC, although I haven't studied the code. It's a fairly obvious improvement, and an Open-Source compiler. Someone will have created the patch.

Upvotes: 0

Zooba

Reputation: 11468

The MASM Forum has a lot of incredible assembly language programmers/hobbyists who have beaten this issue completely to death (have a look through The Laboratory). The results were much like Christopher's response: SSE is incredible for large, aligned, buffers, but going down you will eventually reach such a small size that a basic for loop is just as quick.

Upvotes: 8

Kevin

Reputation: 25957

If you have to allocate your memory as well as initialize it, I would:

Use calloc instead of malloc
Change as much of my default values to be zero as possible (ex: let my default enumeration value be zero; or if a boolean variable's default value is 'true', store it's inverse value in the structure)

The reason for this is that calloc zero-initializes memory for you. While this will involve the overhead for zeroing memory, most compilers are likely to have this routine highly-optimized -- more optimized that malloc/new with a call to memcpy.

Upvotes: 1

Christopher

Reputation: 8992

Memset/memcpy are mostly written with a basic instruction set in mind, and so can be outperformed by specialized SSE routines, which on the other hand enforce certain alignment constraints.

But to reduce it to a list :

For data-sets <= several hundred kilobytes memcpy/memset perform faster than anything you could mock up.
For data-sets > megabytes use a combination of memcpy/memset to get the alignment and then use your own SSE optimized routines/fallback to optimized routines from Intel etc.
Enforce the alignment at the start up and use your own SSE-routines.

This list only comes into play for things where you need the performance. Too small/or once initialized data-sets are not worth the hassle.

Here is an implementation of memcpy from AMD, I can't find the article which described the concept behind the code.

Upvotes: 5

quinmars

Reputation: 11583

You can take a look on liboil, they (try to) provide different implementation of the same function and choosing the fastest on initialization. Liboil has a pretty liberal licence, so you can take it also for proprietary software.

http://liboil.freedesktop.org/

Upvotes: 2

Dan Cristoloveanu

Reputation: 2058

I would always choose an initialization method that is part of the runtime or OS (memset) I am using (worse case pick one that is part of a library that I am using).

Why: If you are implementing your own initialization, you might end up with a marginally better solution now, but it is likely that in a couple of years the runtime has improved. And you don't want to do the same work that the guys maintaining the runtime do.

All this stands if the improvement in runtime is marginal. If you have a difference of an order of magnitude between memset and your own initialization, then it makes sense to have your code running, but I really doubt this case.

Upvotes: 1

slashmais

Reputation: 7155

If memory is not a problem, then precreate a static buffer of the size you need, initialized to your value(s). As far as I know, both these compilers are optimizing compilers, so if you use a simple for-loop, the compiler should generate the optimum assembler-commands to copy the buffer across.

If memory is a problem, use a smaller buffer & copy that accross at sizeof(..) offsets into the new buffer.

HTH

Upvotes: 1

Dark Shikari

Reputation: 8029

It depends what you're doing. If you have a very specific case, you can often vastly outperform the system libc (and/or compiler inlining) of memset and memcpy.

For example, for the program I work on, I wrote a 16-byte-aligned memcpy and memset designed for small data sizes. The memcpy was made for multiple-of-16 sizes greater than or equal to 64 only (with data aligned to 16), and memset was made for multiple-of-128 sizes only. These restrictions allowed me to get enormous speed, and since I controlled the application, I could tailor the functions specifically to what was needed, and also tailor the application to align all necessary data.

The memcpy performed at about 8-9x the speed of the Windows native memcpy, knocing a 460-byte copy down to a mere 50 clock cycles. The memset was about 2.5x faster, filling a stack array of zeros extremely quickly.

If you're interested in these functions, they can be found here; drop down to around line 600 for the memcpy and memset. They're rather trivial. Note they're designed for small buffers that are supposed to be in cache; if you want to initialize enormous amounts of data in memory while bypassing cache, your issue may be more complex.

Upvotes: 4

amo-ej1

Reputation: 3307

Well this all depends on your problem domain and your specifications, have you ran into performance issues, failed to meet timing deadline and pinpointed memset as being the root of all evil ? If it this you're in the one and only case where you could consider some memset tuning.

Then you should also keep in mind that the memset anyhow will vary on the hardware the platform it is ran on, during those five years, will the software run on the same platform ? On the same architecture ? One you come to that conclusion you can try to 'roll your own' memset, typically playing with the alignment of buffers, making sure you zero 32 bit values at once depending on what is most performant on your architecture.

I once ran into the same for memcmpt where the alignment overhead caused some problems, bit typically this will not result in miracles, only a small improvement, if any. If you're missing your requirements by an order of mangnitude than this won't get you any further.

Upvotes: 1

Rob Walker

Reputation: 47502

The DDJ article acknowledges that memset is the best answer, and much faster than what he was trying to achieve:

There is something sacrosanct about C's memory manipulation functions memset, memcpy, and memcmp. They are likely to be highly optimized by the compiler vendor, to the extent that the compiler might detect calls to these functions and replace them with inline assembler instructions — this is the case with MSVC.

So, if memset works for you (ie. you are initializing with a single byte) then use it.

Whilst every millisecond may count, you should establish what percentage of your execution time is lost to setting memory. It is likely very low (1 or 2%??) given that you have useful work to do as well. Given that the optimization effort would likely have a much better rate of return elsewhere.

Upvotes: 10

Paul Tomblin

Reputation: 182878

d) Accept that trying to play "jedi mind tricks" with the initialization will lead to more lost programmer hours than the cumulative milliseconds difference between some obscure but fast method versus something obvious and clear.

Upvotes: 4

State of &quot;memset&quot; functionality in C++ with modern compilers

Context:

Question:

Answers (12)

Related Questions

State of "memset" functionality in C++ with modern compilers