zephyr0110
zephyr0110

Reputation: 225

Dirty page accounting in Linux kernel through /proc/$PID/smaps

TL;DR: how exactly is the kernel able to do dirty page accounting in /proc/$PID/smaps?

Consider the following program statement in C:

static char page1[PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE)));

Now uninitialized variables are zero at start. My understanding is that on the start of the program, the kernel maps uninitialized variables to the zero page, and does copy-on-write lazy allocation of the page. Fine, make sense, and that way kernel can account for dirty page of the uninitialized sections when page fault occurs.

Now consider the statement:

static char page1[PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE))) = {'c'};

Here, the loader will load the values for page1 at initialization of the program, and mark the page as RW. So any write done by the program must be invisible to kernel as no page fault is triggered.

Here is the program I wrote for experimentation:

#define PAGE_SIZE (4*1024)
static char page1[PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE))) = {'c'};
int main()
{
  char c; int i; int *d;
  scanf("%c", &c);                // --------- tag 1
  for(i = 0; i < PAGE_SIZE; i++)
    {
      page1[i] = c;               // --------- tag 2
    }
  d = malloc(sizeof(int));
  while(1);
  return 0;
}

Now before tag 1 and after tag 2 (comments in code), the output of /proc/$PID/smaps for the section containing page1 is pasted below in the table:

smap BEFORE TAG-1 AFTER TAG-2
Size: 8 kB 8 kB
KernelPageSize: 4 kB 4 kB
MMUPageSize: 4 kB 4 kB
Rss: 8 kB 8 kB
Pss: 8 kB 8 kB
Shared_Clean: 0 kB 0 kB
Shared_Dirty: 0 kB 0 kB
Private_Clean: 4 kB 0 kB
Private_Dirty: 4 kB 8 kB
Referenced: 8 kB 8 kB
Anonymous: 4 kB 8 kB

As you can see, the bold parameters above changed.

Questions:

  1. How on earth did the kernel got to know I wrote the page?
  2. What is this Anonymous field and why did it change?

Any other page/blog/manual explaining all the working in detailed would be helpful.

My guess is that maybe the kernel marks the page as RO, even though it is RW so that page fault triggers and it can do the accounting. Or maybe there is some other process that continuously walks the page tables of processes, but that just seems too expensive.

Upvotes: 3

Views: 418

Answers (1)

Marco Bonelli
Marco Bonelli

Reputation: 69276

Now consider the statement

static char page1[PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE))) = {'c'}

Here, the loader will load the values for page1 at init of the program, and mark the page as RW.

You seem to believe that the loader does a write to memory for this statement, but it does not.

What happens in this case is not mmap RW + write of the byte 'c'. That byte is already embedded in your executable at compile time, so the only thing that happens is a mmap RW, nothing more. Something like this:

mmap(0, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd_of_your_elf, offset_of_data_section);

Or, most likely, just mmap(...entire file...) followed by a series of mprotect() with the right permissions for the different sections of the ELF.

Actually, it is not even the loader that does this, but rather it is the kernel itself that maps the executable into memory in this case, assuming you are launching your program as ./exe. The loader only maps the program by itself when it is invoked as /path/to/loader ./exe. See also this other answer of mine where I have a little bit more detailed explanation.


How on earth did the kernel got to know I wrote the page?

As you probably already know, when your program is initially mapped in memory (including the page containing page1), even though the mapping for that page is RW, there is no real need for the kernel to actually allocate memory for the page until a read or write occurs. This technique is known as demand paging. Initially (right after it is mapped) the page is not even present in the page table of your process: it only exists as one of the many vm_area_struct entries in the memory map of your task.

When a page fault occurs (caused by a read or a write) the kernel then decides what to do based on the nature of the mapping. In this case the mapping is file-backed (the actual initial value for the whole page1 array was written in your ELF file at compile time), so the two possible scenarios are as follows:

  1. When a memory read happens, a page fault happens and the page content is read from the file into memory. The newly allocated page is now marked as read only, even though it was mapped as RW (the kernel still knows that this VMA is RW).

  2. When a memory write happens, there are two cases: either (A) the page was already present in memory because of a previous read (and is marked RO), or (B) the page wasn't in memory at all because this is the first memory access to it. In both cases, a page fault happens, the kernel checks if writing is allowed (yes it is), and copy-on-write takes place.

    Since the page was file-backed, but not shared (i.e. not mapped with MAP_SHARED), the data does not need to be written back to the file, so the kernel simply allocates a new anonymous page and either copies the content over from the previous page (case A) or reads the page from the file into memory (case B) before applying the write. This is why you see Anonymous go from 0kB to 4kB.

    Additionally, since the old page only had one user before copy-on-write, it can be deallocated, and this is why you see Rss stay the same. Finally, before the write the page was clean (not dirty), and after the write it becomes dirty (not clean), so this is why you see Private_Clean and Private_Dirty change values.

Upvotes: 2

Related Questions