Steffen Binas
Steffen Binas

Reputation: 1468

What is a safe Maximum Stack Size or How to measure use of stack?

I have an app with a number of worker threads, one for each core. On a modern 8 core machine, I have 8 of these threads. My app loads many plugins, which also have their own worker threads. Because the app uses huge blocks of memory (photos, e.g. 200 MB) I have a memory fragmentation problem (32 bit app). The problem is that every thread allocates the {$MAXSTACKSIZE ...} of address space. It's not using the physical memory but the address space. I reduced the MAXSTACKSIZE from 1 MB to 128 KB, and it seems to work, but I don't know if I'm near to the limit. Is there any possibility to measure how much stack is really used?

Upvotes: 11

Views: 7913

Answers (6)

PhiS
PhiS

Reputation: 4650

For the sake of completeness, I am adding a version of the CommittedStackSize function provided in opc0de's answer for determining the amount of used stack that will work both for x86 32- and 64-bit versions of Windows (opc0de's function is for Win32 only).

opc0de's function queries the address of the base of the stack and the lowest committed stack base from Window's Thread Information Block (TIB). There are two differences among x86 and x64:

  • TIB is pointed to by the FS segment register on Win32, but by the GS on Win64 (see here)
  • The absolute offsets of items in the structure differ (mostly because some items are pointers, i.e. 4 bytes and 8 bytes on Win32/64, respectively)

Additionally note that there is a small difference in the BASM code, because on x64, abs is required to make the assembler use an absolute offset from the the segment register.

Therefore, a version that will work on both Win32 and Win64 version looks like this:

{$IFDEF MSWINDOWS}
function CommittedStackSize: NativeUInt;
//NB: Win32 uses FS, Win64 uses GS as base for Thread Information Block.
asm
 {$IFDEF WIN32}
  mov eax, [fs:04h] // TIB: base of the stack
  mov edx, [fs:08h] // TIB: lowest committed stack page
  sub eax, edx      // compute difference in EAX (=Result)
 {$ENDIF}
 {$IFDEF WIN64}
  mov rax, abs [gs:08h] // TIB: base of the stack
  mov rdx, abs [gs:10h] // TIB: lowest committed stack page
  sub rax, rdx          // compute difference in RAX (=Result)
 {$ENDIF}
end;
{$ENDIF}

Upvotes: 9

TheBlastOne
TheBlastOne

Reputation: 4320

I remember i FillChar'd all available stack space with zeroes upon init years ago, and counted the contiguous zeroes upon deinit, starting from the end. This yielded a good 'high water mark', provided you send your app through its paces for probe runs.

I'll dig out the code when i am back nonmobile.

Update: OK the principle is demonstrated in this (ancient) code:

{***********************************************************
  StackUse - A unit to report stack usage information

  by Richard S. Sadowsky
  version 1.0 7/18/88
  released to the public domain

  Inspired by a idea by Kim Kokkonen.

  This unit, when used in a Turbo Pascal 4.0 program, will
  automatically report information about stack usage.  This is very
  useful during program development.  The following information is
  reported about the stack:

  total stack space
  Unused stack space
  Stack spaced used by your program

  The unit's initialization code handles three things, it figures out
  the total stack space, it initializes the unused stack space to a
  known value, and it sets up an ExitProc to automatically report the
  stack usage at termination.  The total stack space is calculated by
  adding 4 to the current stack pointer on entry into the unit.  This
  works because on entry into a unit the only thing on the stack is the
  2 word (4 bytes) far return value.  This is obviously version and
  compiler specific.

  The ExitProc StackReport handles the math of calculating the used and
  unused amount of stack space, and displays this information.  Note
  that the original ExitProc (Sav_ExitProc) is restored immediately on
  entry to StackReport.  This is a good idea in ExitProc in case a
  runtime (or I/O) error occurs in your ExitProc!

  I hope you find this unit as useful as I have!

************************************************************)

{$R-,S-} { we don't need no stinkin range or stack checking! }
unit StackUse;

interface

var
  Sav_ExitProc     : Pointer; { to save the previous ExitProc }
  StartSPtr        : Word;    { holds the total stack size    }

implementation

{$F+} { this is an ExitProc so it must be compiled as far }
procedure StackReport;

{ This procedure may take a second or two to execute, especially }
{ if you have a large stack. The time is spent examining the     }
{ stack looking for our init value ($AA). }

var
  I                : Word;

begin
  ExitProc := Sav_ExitProc; { restore original exitProc first }

  I := 0;
  { step through stack from bottom looking for $AA, stop when found }
  while I < SPtr do
    if Mem[SSeg:I] <> $AA then begin
      { found $AA so report the stack usage info }
      WriteLn('total stack space : ',StartSPtr);
      WriteLn('unused stack space: ', I);
      WriteLn('stack space used  : ',StartSPtr - I);
      I := SPtr; { end the loop }
    end
    else
      inc(I); { look in next byte }
end;
{$F-}


begin
  StartSPtr := SPtr + 4; { on entry into a unit, only the FAR return }
                         { address has been pushed on the stack.     }
                         { therefore adding 4 to SP gives us the     }
                         { total stack size. }
  FillChar(Mem[SSeg:0], SPtr - 20, $AA); { init the stack   }
  Sav_ExitProc := ExitProc;              { save exitproc    }
  ExitProc     := @StackReport;          { set our exitproc }
end.

(From http://webtweakers.com/swag/MEMORY/0018.PAS.html)

I faintly remember having worked with Kim Kokkonen at that time, and I think the original code is from him.

The good thing about this approach is you have zero performance penalty and no profiling operation during the program run. Only upon shutdown the loop-until-changed-value-found code eats up CPU cycles. (We coded that one in assembly later.)

Upvotes: 3

opc0de
opc0de

Reputation: 11767

Use this to compute the amount of memory committed for the current thread's stack:

function CommittedStackSize: Cardinal;
asm
  mov eax,[fs:$4] // base of the stack, from the Thread Environment Block (TEB)
  mov edx,[fs:$8] // address of lowest committed stack page
                  // this gets lower as you use more stack
  sub eax,edx
end;

Another idea I don't have.

Upvotes: 12

Andr&#233;
Andr&#233;

Reputation: 9112

Reducing $MAXSTACKSIZE won't work because Windows will always align thread stack to 1Mb (?).

One (possible?) way to prevent fragmentation is to reserve (not alloc!) virtual memory (with VirtualAlloc) before creating threads. And release it after the threads are running. This way Windows cannot use the reserved space for the threads so you will have some continuous memory.

Or you could make your own memory manager for large photo's: reserve a lot virtual memory and alloc memory from this pool by hand. (you need to maintain a list of used and used memory yourself).

At least, that's a theory, don't know if it really works...

Upvotes: 0

David Heffernan
David Heffernan

Reputation: 612784

Whilst I am sure that you can reduce the thread stacksize in your app, I don't think it will address the root cause of the problem. You are using an 8 core machine now, but what happens on a 16 core, or a 32 core etc.

With 32 bit Delphi you have a maximum address space of 4GB and so this does limit you to some degree. You may well need to use smaller stacks for some or all of your threads, but you will still face problems on a big enough machine.

If you help your app scale better to larger machines you may need to take one or other of the following steps:

  1. Avoid creating significantly more threads than cores. Use a thread pool architecture that is available to your plug-ins. Without the benefit of the .net environment to make this easy you will be best coding against the Windows thread pool API. That said, there must be a good Delphi wrapper available.
  2. Deal with the memory allocation patterns. If your threads are allocating contiguous blocks in the region of 200MB then this is going to cause undue stress on your allocator. I have found that it is often best to allocate such large amounts of memory in smaller, fixed size blocks. This approach works around the fragmentation problems you are encountering.

Upvotes: 1

Martin James
Martin James

Reputation: 24847

Even if all 8 threads were to come close to using their 1MB of stack, that's only 8MB of virtual memory. IIRC, the default initial stack size for threads is 64K, increasing upon page-faults unless the process thread-stack limit is reached, at which point I assume your process will be stopped with a 'Stack overflow' messageBox :((

I fear that reducing the process stack limit $MAXSTACKSIZE will not alleviate your fragmentation/paging issue much, if anything. You need more RAM so that the resident page set of your mega-photo-app is bigger & so thrashing reduced.

How many threads are there, overall, on average, in your process? Task manager can show this.

Rgds, Martin

Upvotes: 1

Related Questions