user1042840
user1042840

Reputation: 1945

What are benefits of allocating a page-aligned memory chunk?

I realize that most CPUs are better at reading data at an aligned memory address, that is at memory address that is a multiple of CPU word. However, in many places I read about allocating a page-aligned memory. Why might someone want to get a page-aligned memory address? Is it only for even bigger performance?

Upvotes: 6

Views: 10322

Answers (3)

Luis Colorado
Luis Colorado

Reputation: 12668

Alignment is something that always makes some performance issues. when you write(2) or read(2) a file, it's best if you can adjust the limits of your reading to block aligments, because you make kernel do two block reads instead of one. The worst case being just reading two bytes on a block boundary. Suppose you have a block size of 1024bytes, this code:

char var[2];
int fd;

fd = open("/etc/passwd", O_RDONLY);
lseek(fd, 1023UL, SEEK_SET);
read(fd, &var, sizeof var);

Will make the kernel to force two block reads (at most, as the blocks could be already cached before) for only a two bytes read(2) call.

In the case of memory, all this stuff is normally managed by malloc(3), and, as you don't fail on page faults, you don't get any performance penalties (that's the reason you don't have any standar library function to get aligned memory, even in demand paged virtual systems) as far as you consume memory, the kernel allocs it in pages for you. The processor virtual memory system makes page alignment almost transparent. Only in case you have an unaligned memory access (suppose you access a 32bit integer access misaligned ---unprobable--- to two pages, and those two pages have been swapped out by the kernel, you'll have to wait for the kernel to swap in two pages of memory instead of one ---but that's far improbable thing to occur, the compiler normally forces inner loops to not fail between page boundaries to minimize the probability of this to happen, and you have also the instruction cache to cope with these things)

Said this, there are some places where you do get performance improvements if you somewhat align memory. I'll try to show you a scenario of this:

Suppose you need to dynamically manage a lot of small structures (let's say 16bytes long) and you plan to manage them with malloc(). malloc(3) manages memory including a header in each chunk of memory allocated (let's say this header is 8 bytes long) making the overhead of memory 50% percent more than the ideal. If you arrange to get memory in chunks of (let's say) 64 structures you'll get just one of those headers (8bytes) for each 64*16 = 1024 bytes (amounting for just roughly an 8%)

To manage this, you have to consider knowing to which chunk all of this structures belong (so you can free(3) the chunk when not in use), and you can do this in two ways: 1.- Using a pointer (adding 4 bytes to each structure size --this is pointless as you'll add 4 bytes to each structure, lossing a 25% of memory again) to point to the chunck, or 2.- *forcing the chunck to be aligned, so the chunk address can be easily calculated from the struct address (you only need to substract the rest of the division mod chunksize to the struct address) to get the chunk address. This last method doesn't impose any overhead to locate the chunck, but imposes the practice of all chunks to be chunk aligned (not page aligned).

In this way, you improve performance too much, as you reduce considerably the amount of malloc(3) calls and the waste of memory imposed by allocating small amounts of memory.

By the way, malloc doesn't ask the operating system for the memory you ask it at each call. It allocates memory in chunks, in a manner similar as has been explained here, and normal implementations don't even manage to return the allocated memory to the system again (reusing the freed memory before allocating new one) It controls the calls to sbrk(2) system call, what means that you are going to interfere with malloc in case you use this system call.

Linux/unix will give you page aligned memory when you use shmat(2) system call. Try reading this and related documents.

Upvotes: 1

Andrew Henle
Andrew Henle

Reputation: 1

Alignment restrictions are usually associated with direct IO - which bypasses the page cache, copying data to/from disk directly into or from the address space of a process. This can provide significant performance improvements in cases where the page cache is not needed - such as streaming multiple gigabytes of data, especially when doing IO to/from extremely fast disk systems.

Note that only some file systems support direct IO.

On Linux, RedHat's documentation is, in part:

Direct I/O best practices


Users must always take care to use properly aligned and sized IO. This is especially important for Direct I/O access. Direct I/O should be aligned on a 'logical_block_size' boundary and in multiples of the 'logical_block_size'. With native 4K devices (logical_block_size is 4K) it is now critical that applications perform Direct I/O that is a multiple of the device's 'logical_block_size'. This means that applications that do not perform 4K aligned I/O, but 512-byte aligned I/O, will break with native 4K devices. Applications may consult a device's "I/O Limits" to ensure they are using properly aligned and sized I/O. The "I/O Limits" are exposed through both sysfs and block device ioctl interfaces (also see: libblkid).

sysfs interface

/sys/block//alignment_offset

/sys/block///alignment_offset

/sys/block//queue/physical_block_size

/sys/block//queue/logical_block_size

/sys/block//queue/minimum_io_size

/sys/block//queue/optimal_io_size

Note that the use of direct IO can be limited by actual hardware, as well as software. As noted in the RedHat documentation, physical device limitations matter.

To use direct IO, on Linux the file needs to be opened with the O_DIRECT flag:

int fd = open( filename, O_RDONLY | O_DIRECT );

In my experience, direct IO can result in 20-30% gains in IO performance under certain circumstances. Those circumstances usually involve streaming large amounts of data to/from a file on a very fast file system with the application performing no or very few seek() calls.

Upvotes: 1

user2371524
user2371524

Reputation:

The "traditional" way to allocate memory is to have it in a contiguous address space (the "heap", growing upwards by calls to sbrk()). Each time you hit a page boundary, there will be a page fault and you get mapped a new page. There are two consequences of this strategy:

  1. pages can only be freed when all allocations inside that page are freed AND when all other allocations are mapped to lower addresses. (the typical effect of heap fragmentation).
  2. larger allocations might occupy one page more than strictly needed (if they start somewhere in the middle of a page).

So this strategy is only suitable for smaller blocks of memory where you don't want to "waste" a whole page for each allocation.

For bigger chunks, it's better to use mmap() which maps you new pages somewhere directly, so you get "page aligned memory". Using this, your allocation doesn't share pages with other allocations. As soon as you don't need the memory any more, you can give it back to the OS. Note that many malloc()implementations choose automatically whether to allocate using sbrk() or mmap(), depending on the size of the desired allocation.

Upvotes: 10

Related Questions