Reputation: 247
I'm dealing with a gargantuan C++ code for computational physics (that I didn't write) which calls other executables using system()
calls. Sometimes in the middle of a simulation these system()
calls will fail even for simple calls like system("echo something);
When they fail they return immediately with return value -1.
I created a version of the code that uses popen()
instead of system()
to launch these other executables. In this version, popen() fails and errno is set to 12 (ENOMEM).
This is running on a machine with 96GB of RAM running CentOS 6.3 (via ROCKS 6.1) through the Torque PBS system.
Note that this behavior is somewhat uncommon, but it appears to happen for simulations which use large amounts of memory -- but far less than the amount of memory that is available.
I currently have a simulation running which is exhibiting this behavior. It is attempting to make a system()
call every 30 seconds and failing, which allows me to monitor the OS memory resources. The contents of /proc/meminfo
are
MemTotal: 99195180 kB
MemFree: 1758804 kB
Buffers: 14612 kB
Cached: 46502432 kB
SwapCached: 7004 kB
Active: 60758772 kB
Inactive: 35238760 kB
Active(anon): 45458924 kB
Inactive(anon): 4024068 kB
Active(file): 15299848 kB
Inactive(file): 31214692 kB
Unevictable: 9752 kB
Mlocked: 9752 kB
SwapTotal: 1023992 kB
SwapFree: 999432 kB
Dirty: 16 kB
Writeback: 8 kB
AnonPages: 49483620 kB
Mapped: 10292 kB
Shmem: 8 kB
Slab: 235356 kB
SReclaimable: 193468 kB
SUnreclaim: 41888 kB
KernelStack: 2120 kB
PageTables: 99536 kB
NFS_Unstable: 4 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 50621580 kB
Committed_AS: 49576180 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 482936 kB
VmallocChunk: 34307833876 kB
HardwareCorrupted: 8 kB
AnonHugePages: 43315200 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5568 kB
DirectMap2M: 2082816 kB
DirectMap1G: 98566144 kB
The contents of /proc/5939/status
(which is the process in question) is
Name: BAD_EXECUTABLE
State: S (sleeping)
Tgid: 5939
Pid: 5939
PPid: 5938
TracerPid: 0
Uid: 505 505 505 505
Gid: 505 505 505 505
Utrace: 0
FDSize: 256
Groups: 426 505 801
VmPeak: 49733876 kB
VmSize: 49482532 kB
VmLck: 0 kB
VmHWM: 49721496 kB
VmRSS: 49470248 kB
VmData: 49481080 kB
VmStk: 128 kB
VmExe: 1316 kB
VmLib: 0 kB
VmPTE: 96656 kB
VmSwap: 10624 kB
Threads: 1
SigQ: 0/774828
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: ffffff
Cpus_allowed_list: 0-23
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list: 0-1
voluntary_ctxt_switches: 57003
nonvoluntary_ctxt_switches: 1057385
I'm at a bit of a loss at how to debug this issue, especially since I can't recreate it with a smaller simulation. My simulation says it's using 47GB of memory, while /proc/meminfo
shows that less than 2GB of 96GB of memory is free, and there shouldn't be anything else running that uses tens of GBs of memory.
This forum seems to indicate that previous memory errors could have corrupted the heap. Is this a valid possibility? What else could I look at that would help me narrow down this issue?
Upvotes: 2
Views: 768