McLovin
McLovin

Reputation: 3417

Having more allocated memory slows down operations?

I'm reading a book called Real-Time Rendering and as an optimization in a ray-sphere intersection algorithm the squared radius of the sphere is used, so the author says:

The scalar r2 (square of radius) could be computed once and stored within the data structure of the sphere in an attempt to gain further efficiency. In practice such an "optimization" may be slower, as more memory is then access, a major factor for algorithm performance.

How could this be not efficient? The r2 value will be accessed once in the local scope of the function, could this be slower than explicitly caluclarting r2 = r * r?

I'm not even sure if the slow operation is the value accessment or actually having that data stored in memory.

Upvotes: 4

Views: 277

Answers (3)

g24l
g24l

Reputation: 3125

It is the well known space/time trade off. Don't think about just one sphere but many.

template<typename T> struct sphere
{
  T posx,posy;
  T r;
  T r2;

template<typename T> static sphere<T> create( T posx, T posy, T r );
}

This means that you store an extra few bytes every time. In order to do your computation this will have to be fetched from memory. Obviously when there are a few Ray tracings to happen that is not a problem. Though in real scenarios you will have many many spheres. You gain by making them smaller.

Upvotes: 1

user4842163
user4842163

Reputation:

I remember being at this point before. It wasn't too long ago and I'm still learning about this. I'll try to provide an informal answer to complement the already precise one.

Of particular help to me were these two links:

The quickest way to get started though is to grab yourself a decent profiler which can break down the execution times of every little part of your codebase involved in a profiling session. Next, as you work your way to more difficult hotspots which don't have a glaring algorithmic bottleneck, start drilling down to measuring cache misses and branch mispredictions.

After that, gaining the knowledge of why these hotspots exist in spite of a fairly optimal algorithm and what to do about them starts to become fairly automatic -- you'll discover the information on your own chasing hotspots.

Memory Hierarchy

Yet, put simply, the computer architecture has a memory hierarchy ranging from small, fast, expensive memory to large, slow, inexpensive memory. It spans all the way from a massive hard disk which is very slow to paging into DRAM in, say, 4 kilobyte chunks, and down to a CPU cache hierarchy consisting of 64-byte cache lines and all the way into a little teeny register which might just be a measly 8 bytes (64 bits). And that little teeny register is where most of the operations are performed.

Once the operations are performed and the results are computed, typically the results must work their way back up this memory hierarchy so that they can be saved persistently while making room for new operations to be performed on new regions of memory.

To do this quickly, hardware is often designed to grab memory from those bigger but slower forms of memory in fairly large chunks. For example, secondary storage devices like a hard disk are much slower than DRAM, so the computer pages in memory in, say, 4 kilobyte chunks. The assumption there is that, even if you immediately requested just an 8 byte (64-bit chunk), you'll likely access data from the surrounding regions of memory. So memory is paged into DRAM in aligned 4-kilobyte chunks, e.g., with the assumption (or "hope") that you, the programmer, will access a great deal of that chunk in the following set of instructions.

The same kind of process applies from DRAM to CPU cache. The CPU cache is considerably faster than DRAM, so the computer tends to fetch memory from DRAM into a CPU cache in fairly large chunks (typically 64 byte aligned cache lines). The assumption/prediction/hope of the hardware designers is, once again, that you'll make use of that chunk before it gets evicted.

And this kind of process repeats again from CPU cache to register (which is even faster than CPU cache), moving from slower memory to faster memory in aligned handfuls (though this time in pretty small handfuls).

Speed

So much of the speed of a software is ultimately bottlenecked by memory access. It takes an exceptionally-optimal memory layout or really heavy computation on a small chunk of memory to actually be bottlenecked by arithmetic more than by this process of transferring chunks of data from slower to faster forms of memory. You might see this in real-world benchmarks where using AVX instructions improved the speed of a real world operation that mostly consists of floating-point arithmetic by 45%. Why did we get only 45% when the AVX instructions should be 8 times faster than the scalar versions? Putting aside potential compiler optimizations which might insert AVX instructions anyway, a lot of that has to do with memory. Computers can do arithmetic really fast, but they can only do it as quickly as they can load in memory into those faster but teeny forms of memory (registers, e.g.).

From a software engineering standpoint, it helps a lot if you can make your data smaller, contiguous, aligned properly to the size of a register, cache line, and/or page, and accessed contiguously in chunks. A lot of the assumptions and optimizations here by the hardware designers favor an array-like sequential access pattern where you're accessing that contiguous region of memory while cached in a faster form prior to having it evicted from that faster form of memory to make room for other operations on other regions of memory.

How could this be not efficient? The r2 value will be accessed once in the local scope of the function, could this be slower than explicitly caluclarting r2 = r * r?

If it's stored in the local scope of a function, chances are that it will be stored and kept in a register for the whole duration. What the author was specifically referring to was keeping the data for r^2 in a sphere as a form of memoization.

In that case, the size of each sphere being processed increases in size, and so fewer spheres end up potentially fitting into these faster chunks of memory. It may also throw off the alignment of a sphere and lead you to scenarios where a single sphere straddles two aligned cache lines, potentially taking a hit in performance that way.

You can see a similar trend with look up tables. LUTs can often diminish performance and lead to disappointing results, especially if they're rather large. We end up trading reduced arithmetic for increased memory access, and memory access tends to much more quickly become the bottleneck.

So it's very counter-intuitive, and this is far from an expert answer, but the way the machine works and what it considers to be "expensive work" is very different from how we tend to naturally think of it. The best way to start kind of developing an intuition for these things and prevent more bottlenecks in foresight than tackling than in hindsight with a profiler in hand is to start profiling a lot. At that point, you'll start to gain some level of intuition about what sort of data structures are cache-friendly and which are not (linked structures lacking a contiguous allocator, e.g.), though a lot of this still inevitably has to be discovered in hindsight. Even the developers at Intel use profilers for their code, as the complexity of the hardware has increased to a point where it's very difficult to predict exactly what it's going to do as opposed to just realizing it in hindsight with a profiler in hand.

So my suggestion is to kind of try to erase what seems intuitive to a human as being "less work", to start thinking about it more in terms of memory access and locality of reference. Also this is a nice starter link:

https://en.wikipedia.org/wiki/Locality_of_reference

Upvotes: 2

You really should benchmark, but what the author perhaps thought about is the CPU cache. Sometimes (often, but not always) a repeated cache miss (or a wrong branch prediction) will slow down your program.

Typically, when a (super-scalar, pipelined) processor is missing the cache (so an L3 cache miss) to the point of fetching data from your RAM stick, it could lose several hundred cycles (or nanoseconds), and that is enough time to make several hundred arithmetic operations (on data inside registers, or inside the L1 cache). Hence, it may happen that memoizing a simple computation might not be worth-while (e.g. because having a larger structure may require more memory bandwidth and more cache misses).

But the evil is in the details, you need to benchmark (and things could be different on different computers, e.g. on your laptop and on my desktop running the same Linux operating system, with the same compiler and the same binary executable), of course with optimizations enabled in your compiler (e.g. g++ -Wall -O2 -march=native with GCC...).

You'll learn much more by reading books on computer architecture.

Read this nice thought-provoking page on teach yourself programming in ten years. It also gives a table of Approximate timing for various operations on a typical PC (which it has borrowed elsewhere).

Upvotes: 4

Related Questions