c++segmentation-faultundefined-behaviorsimdintrinsics

Reputation: 1856

SIMD load across memory boundary doesn't cause segfault?

Suppose I do a (unaligned) packed load _mm256_loadu_pd on a double (see code snippet below). Does this violate the strict aliasing rule or otherwise result in undefined behavior as per C++ standard? Shouldn't it trigger a segmentation fault in theory?

If not, can this behavior be relied upon? (This is useful when, say, loading 3 doubles in one go.)

The following code compiles without warnings (gcc-14.2.1 g++ -mavx -pedantic -Wall) and runs fine (on GNU/Linux 6.13.2-arch1-1):

#include <immintrin.h>
#include <cstdio>

int main() {
    double* ptr = new double {};
    double buf[4];
    _mm256_storeu_pd(buf, _mm256_loadu_pd(ptr));
    delete ptr;
    std::printf("%e\n", buf[3]);
}

No segfaults whatsoever. ASan (-fsanitize=address) does report heap buffer overflow though.

Edit: Following Wenzel's link, I think this is not UB (because it's not the C++ standard that defines the SIMD types), but the behavior may be implementation dependent. My question is more like why this load is allowed (by kernel? by hardware?) at all? Doesn't this give me access to memory that I do not own?

Upvotes: 2

Answers (4)

Peter Cordes

Reputation: 365257

As @gnasher729 says, UB doesn't mean a fault is required. Just the opposite: the compiler can assume it doesn't happen, because whatever does happen at any point earlier or later in the program (including continuing to run normally) is allowed by the ISO C++ standard.

Hardware memory protection happens with page granularity. As far as the kernel is concerned, your process owns the whole 4K or 2M page containing the double. The finer-grained bookkeeping for new/delete is only done in user-space.

A load that includes any valid bytes will only fault if it crosses into the next page and that next page is unmapped. This can't happen for an aligned load that accesses any valid bytes, since the page size is (much) larger than the vector width. This enables SIMD vectorization even for algorithms like strlen where the last valid byte isn't known until we actually read it. (With some address math for the start of the loop, then using aligned loads in the loop.)
See Is it safe to read past the end of a buffer within the same page on x86 and x64?

new could return a pointer to a double in the last 8 or 16 bytes of a page, but happens not to for the first allocation in a fresh program. (The memory will be aligned by at least alignof(double), which is only 4 on i386 Linux. In practice libstdc++ / glibc will return memory sufficiently aligned for max_align_t, which is 16 on i386 and x86-64 GNU/Linux. But only 8 on 32-bit x86 Windows.)

(There is some research into memory-safe ISAs: Why can't we have a safe ISA? - for example the CHERI project based on RISC-V. Until/unless we're compiling for and running on something like that, we can't expect hardware to trap accesses outside object bounds within a page. On current hardware, we can only get that with extra software checking that comes at great performance cost, like valgrind or -fsanitize=address)

Your store is to a 32-byte array (double buf[4];) so you're not corrupting anything with it. Storing past the end of an object is very bad even if you don't fault: you could corrupt the allocator's bookkeeping data, or the payload of another allocation.

Even load / blend / store (putting back the same bytes you loaded outside the bytes you own) isn't thread-safe: you could have reverted a modification by another thread. For this reason, compilers must not invent stores to objects the abstract machine hasn't at least read; for things you have read, it would be data-race UB if another thread was writing it. But compilers often don't invent stores even then. So write arr[i] = cond ? x : arr[i]; to allow auto-vectorization with a SIMD blend, instead of if(cond) arr[i] = x; for cases where no other thread is accessing this region of the array so it is safe to load/blend/store. (Fun fact: SVE, AVX-512, and AVX2 for float/double, have masked stores that allow vectorization of conditional stores, but the AVX2 ones are not fast on AMD even with Zen 4, even though it handles AVX-512 masked stores efficiently. https://uops.info/ - vmaskmovpd stores and vmovapd mem{k}, ymm)

I think this is not UB (because it's not the C++ standard that defines the SIMD types), but the behavior may be implementation dependent.

That reasoning is faulty. Language extensions can define some but not all ways of using them. You are reading past the end of a new double{} allocation, which is UB.

It's even visible at compile-time. But in practice compilers currently just generate asm instructions that load from the address you ask it to load from. Especially when you don't enable optimization.

The alternative would be for the compiler to assume this path of execution is unreachable since compilers can in general assume that programs don't encounter undefined behaviour. If they actually do, the standard doesn't require any specific behaviour for any of your program before or after; literally anything is allowed to happen, including continuing to run like nothing happened. UB is the opposite of "exception required" / "must trap"; it allows optimizers to make assumptions. Sanitizers like -fsanitize=undefined or -fsanitize=address do change that, though, making some kinds of UB an error that gets reported.

following Wenzel's link,

It's not a duplicate of *Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? there's no raw deref of __m256d*, only opaque loadu and storeu functions which do unaligned aliasing-safe loads and stores. They're equivalent to memcpy in terms of correctness and which bytes are accessed.

Upvotes: 4

robthebloke

Reputation: 9678

It works fine, until the day it doesn't! Typically on Win32, I've never seen a problem with reading outside of a valid memory range (YMMV, I stopped doing this at least 12 years ago!). On linux however, not so much (so no, don't rely on the behaviour!)

On Linux/Mac/iOS you'll typically have problems when you're exceeding the memory range of mmap (e.g. reading from outside a mapped pointer to vertex buffer data in OpenGL/Vulkan). You might have permission to read from the first value, but not the latter values (so... Bang! Seg fault).

In the case of using new, you may get away with it (until you won't). Typically, new/delete will operate on a larger page of memory, from which the allocations are returned (typically 16byte aligned). Every so often though, you may hit the end of that page, which isn't great!

Certainly, you should not rely on the values being loaded as anything sensible. So yeah, using the value of buf[3] would be UB (it might be zero in debug builds, random garbage in release builds)

In this case there really isn't a need though, given that you can load a single value with the _mm256_load_sd intrinsic:

    _mm256_storeu_pd(buf, _mm256_load_sd(ptr));

If you are loading 2x doubles, use _mm_loadu_pd, followed with a _mm256_castpd128_pd256.

If you are loading 3x doubles, use _mm256_maskload_pd.

Any other way, and you'll hit problems with UB eventually (often with a seg fault at seemingly random times)

Upvotes: 0

Alan Birtles

Reputation: 36488

The operating system gives your program pages of memory to work with, these are often 4kb in size. You can read and write wherever you like within that page and the OS won't notice or care.

You'll almost certainly corrupt something if you write outside your buffers but you'll usually get away with reading outside the bounds of a buffer. The issue is that you don't know where inside the page a particular buffer is going to be placed, if it happens to be placed near the end of the page reading off the end of it might indeed cause a segfault.

Any reading outside the bounds of a variable or buffer is undefined behaviour and should be avoided, even if it appears to work most of the time, you'll be setting yourself up for a hard to debug random crash at some point in the future.

Upvotes: 0

gnasher729

Reputation: 52602

Undefined behaviour doesn’t mean ”segfault”. It means “anything can happen”, and if the Intel instruction set manual says “performs an unaligned access” it will perform an unaligned access.

Upvotes: 0

SIMD load across memory boundary doesn&#39;t cause segfault?

Answers (4)

Related Questions

SIMD load across memory boundary doesn't cause segfault?