Reputation: 11
My server trigged OOM killer and I am trying to understand why. System has lot of RAM 128 GB and it looks like around 70GB of it was actually used. Reading through previous questions about OOM, it looks like this might be a case of memory fragmentation. See the syslog output
Jun 23 17:20:10 server1 kernel: [517262.504589] gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Jun 23 17:20:10 server1 kernel: [517262.504593] gmond cpuset=/ mems_allowed=0-1
Jun 23 17:20:10 server1 kernel: [517262.504598] CPU: 4 PID: 1522 Comm: gmond Tainted: P OE 3.15.1-031501-lowlatency #201406161841
Jun 23 17:20:10 server1 kernel: [517262.504599] Hardware name: Dell Inc. PowerEdge R420/0K29HN, BIOS 2.3.3 07/10/2014
Jun 23 17:20:10 server1 kernel: [517262.504601] 0000000000000000 ffff880fce2ab848 ffffffff817746ec 0000000000000007
Jun 23 17:20:10 server1 kernel: [517262.504603] ffff880f74691950 ffff880fce2ab898 ffffffff8176a980 ffff880f00000000
Jun 23 17:20:10 server1 kernel: [517262.504605] 000201da81383df8 ffff881470376540 ffff881dcf7ab2a0 0000000000000000
Jun 23 17:20:10 server1 kernel: [517262.504607] Call Trace:
Jun 23 17:20:10 server1 kernel: [517262.504615] [<ffffffff817746ec>] dump_stack+0x4e/0x71
Jun 23 17:20:10 server1 kernel: [517262.504618] [<ffffffff8176a980>] dump_header+0x7e/0xbd
Jun 23 17:20:10 server1 kernel: [517262.504620] [<ffffffff8176aa16>] oom_kill_process.part.6+0x57/0x30a
Jun 23 17:20:10 server1 kernel: [517262.504623] [<ffffffff811654e7>] oom_kill_process+0x47/0x50
Jun 23 17:20:10 server1 kernel: [517262.504625] [<ffffffff81165825>] out_of_memory+0x145/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504628] [<ffffffff8116c1ba>] __alloc_pages_nodemask+0xb1a/0xc40
Jun 23 17:20:10 server1 kernel: [517262.504634] [<ffffffff811adba3>] alloc_pages_current+0xb3/0x180
Jun 23 17:20:10 server1 kernel: [517262.504636] [<ffffffff81161737>] __page_cache_alloc+0xb7/0xd0
Jun 23 17:20:10 server1 kernel: [517262.504638] [<ffffffff81163f80>] filemap_fault+0x280/0x430
Jun 23 17:20:10 server1 kernel: [517262.504642] [<ffffffff8118a0d9>] __do_fault+0x39/0x90
Jun 23 17:20:10 server1 kernel: [517262.504644] [<ffffffff8118e31e>] do_read_fault.isra.59+0x10e/0x1d0
Jun 23 17:20:10 server1 kernel: [517262.504646] [<ffffffff8118e870>] do_linear_fault.isra.61+0x70/0x80
Jun 23 17:20:10 server1 kernel: [517262.504647] [<ffffffff8118e986>] handle_pte_fault+0x76/0x1b0
Jun 23 17:20:10 server1 kernel: [517262.504652] [<ffffffff81095fe0>] ? lock_hrtimer_base.isra.25+0x30/0x60
Jun 23 17:20:10 server1 kernel: [517262.504654] [<ffffffff8118eea4>] __handle_mm_fault+0x1b4/0x360
Jun 23 17:20:10 server1 kernel: [517262.504655] [<ffffffff8118f101>] handle_mm_fault+0xb1/0x160
Jun 23 17:20:10 server1 kernel: [517262.504658] [<ffffffff81784667>] ? __do_page_fault+0x2b7/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504660] [<ffffffff81784522>] __do_page_fault+0x172/0x5a0
Jun 23 17:20:10 server1 kernel: [517262.504664] [<ffffffff8111fdec>] ? acct_account_cputime+0x1c/0x20
Jun 23 17:20:10 server1 kernel: [517262.504667] [<ffffffff810a73a9>] ? account_user_time+0x99/0xb0
Jun 23 17:20:10 server1 kernel: [517262.504669] [<ffffffff810a79dd>] ? vtime_account_user+0x5d/0x70
Jun 23 17:20:10 server1 kernel: [517262.504671] [<ffffffff8178498e>] do_page_fault+0x3e/0x80
Jun 23 17:20:10 server1 kernel: [517262.504673] [<ffffffff817811f8>] page_fault+0x28/0x30
Jun 23 17:20:10 server1 kernel: [517262.504674] Mem-Info:
Jun 23 17:20:10 server1 kernel: [517262.504675] Node 0 DMA per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504677] CPU 0: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504678] CPU 1: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504679] CPU 2: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504680] CPU 3: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504681] CPU 4: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504682] CPU 5: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504683] CPU 6: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504684] CPU 7: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504685] CPU 8: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504686] CPU 9: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 10: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504687] CPU 11: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504688] CPU 12: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504689] CPU 13: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504690] CPU 14: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504691] CPU 15: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504692] CPU 16: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504693] CPU 17: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504694] CPU 18: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504695] CPU 19: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504696] CPU 20: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504697] CPU 21: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 22: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504698] CPU 23: hi: 0, btch: 1 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504699] Node 0 DMA32 per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504701] CPU 0: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504702] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504703] CPU 2: hi: 186, btch: 31 usd: 34
Jun 23 17:20:10 server1 kernel: [517262.504704] CPU 3: hi: 186, btch: 31 usd: 27
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 4: hi: 186, btch: 31 usd: 30
Jun 23 17:20:10 server1 kernel: [517262.504705] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504706] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504707] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504708] CPU 8: hi: 186, btch: 31 usd: 173
Jun 23 17:20:10 server1 kernel: [517262.504709] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504710] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504711] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504712] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504713] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504714] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504715] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504716] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504717] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504718] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504719] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504720] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504721] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504722] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504723] Node 0 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504724] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504725] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504726] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504727] CPU 3: hi: 186, btch: 31 usd: 14
Jun 23 17:20:10 server1 kernel: [517262.504728] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504729] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504730] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504731] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504732] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504733] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504734] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504735] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504736] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504737] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504738] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504739] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504740] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504741] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504742] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504743] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504744] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504745] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504746] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504747] Node 1 Normal per-cpu:
Jun 23 17:20:10 server1 kernel: [517262.504748] CPU 0: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504749] CPU 1: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504750] CPU 2: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504751] CPU 3: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504752] CPU 4: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504753] CPU 5: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504754] CPU 6: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504755] CPU 7: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504756] CPU 8: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504757] CPU 9: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 10: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504758] CPU 11: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504759] CPU 12: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504760] CPU 13: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504761] CPU 14: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504762] CPU 15: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504763] CPU 16: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504764] CPU 17: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504765] CPU 18: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504766] CPU 19: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504767] CPU 20: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504768] CPU 21: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504769] CPU 22: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504770] CPU 23: hi: 186, btch: 31 usd: 0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_anon:17833290 inactive_anon:2465707 isolated_anon:0
Jun 23 17:20:10 server1 kernel: [517262.504773] active_file:573 inactive_file:595 isolated_file:36
Jun 23 17:20:10 server1 kernel: [517262.504773] unevictable:0 dirty:4 writeback:0 unstable:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free:82698 slab_reclaimable:43224 slab_unreclaimable:11476749
Jun 23 17:20:10 server1 kernel: [517262.504773] mapped:2465518 shmem:2465767 pagetables:66385 bounce:0
Jun 23 17:20:10 server1 kernel: [517262.504773] free_cma:0
Jun 23 17:20:10 server1 kernel: [517262.504776] Node 0 DMA free:14804kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15968kB managed:15828kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504779] lowmem_reserve[]: 0 2933 64370 64370
Jun 23 17:20:10 server1 kernel: [517262.504782] Node 0 DMA32 free:247776kB min:2048kB low:2560kB high:3072kB active_anon:1774744kB inactive_anon:607052kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3083200kB managed:3003592kB mlocked:0kB dirty:16kB writeback:0kB mapped:607068kB shmem:607068kB slab_reclaimable:25524kB slab_unreclaimable:302060kB kernel_stack:4928kB pagetables:3100kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2660 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504785] lowmem_reserve[]: 0 0 61436 61436
Jun 23 17:20:10 server1 kernel: [517262.504787] Node 0 Normal free:34728kB min:42952kB low:53688kB high:64428kB active_anon:30286072kB inactive_anon:9255576kB active_file:236kB inactive_file:640kB unevictable:0kB isolated(anon):0kB isolated(file):16kB present:63963136kB managed:62911420kB mlocked:0kB dirty:0kB writeback:0kB mapped:9255000kB shmem:9255724kB slab_reclaimable:86416kB slab_unreclaimable:22165372kB kernel_stack:21072kB pagetables:121112kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:13936 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504791] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504793] Node 1 Normal free:33484kB min:45096kB low:56368kB high:67644kB active_anon:39272344kB inactive_anon:200kB active_file:2112kB inactive_file:1752kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:67108864kB managed:66056916kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:276kB slab_reclaimable:60956kB slab_unreclaimable:23439564kB kernel_stack:13536kB pagetables:141328kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:18448 all_unreclaimable? yes
Jun 23 17:20:10 server1 kernel: [517262.504797] lowmem_reserve[]: 0 0 0 0
Jun 23 17:20:10 server1 kernel: [517262.504799] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (R) 3*4096kB (M) = 14804kB
Jun 23 17:20:10 server1 kernel: [517262.504807] Node 0 DMA32: 4660*4kB (UEM) 2172*8kB (EM) 1739*16kB (EM) 1046*32kB (UEM) 629*64kB (EM) 344*128kB (UEM) 155*256kB (E) 46*512kB (UE) 3*1024kB (E) 0*2048kB 0*4096kB = 247904kB
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
Jun 23 17:20:10 server1 kernel: [517262.504829] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504830] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 23 17:20:10 server1 kernel: [517262.504831] 2467056 total pagecache pages
Jun 23 17:20:10 server1 kernel: [517262.504832] 0 pages in swap cache
Jun 23 17:20:10 server1 kernel: [517262.504833] Swap cache stats: add 0, delete 0, find 0/0
Jun 23 17:20:10 server1 kernel: [517262.504834] Free swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504834] Total swap = 0kB
Jun 23 17:20:10 server1 kernel: [517262.504835] 33542792 pages RAM
Jun 23 17:20:10 server1 kernel: [517262.504836] 0 pages HighMem/MovableOnly
Jun 23 17:20:10 server1 kernel: [517262.504837] 262987 pages reserved
Jun 23 17:20:10 server1 kernel: [517262.504838] 0 pages hwpoisoned
Jun 23 17:20:10 server1 kernel: [517262.504839] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Jun 23 17:20:10 server1 kernel: [517262.504866] [ 569] 0 569 4997 144 13 0 0 upstart-udev-br
Jun 23 17:20:10 server1 kernel: [517262.504868] [ 578] 0 578 12891 187 29 0 -1000 systemd-udevd
Jun 23 17:20:10 server1 kernel: [517262.504873] [ 692] 101 692 80659 2295 59 0 0 rsyslogd
Jun 23 17:20:10 server1 kernel: [517262.504875] [ 750] 0 750 4084 331 13 0 0 upstart-file-br
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504877] [ 792] 0 792 3815 53 13 0 0 upstart-socket-
Jun 23 17:20:10 server1 kernel: [517262.504879] [ 842] 111 842 27001 275 53 0 0 dbus-daemon
Jun 23 17:20:10 server1 kernel: [517262.504880] [ 851] 0 851 8834 101 22 0 0 systemd-logind
Jun 23 17:20:10 server1 kernel: [517262.504886] [ 1232] 0 1232 2558 572 8 0 0 dhclient
Jun 23 17:20:10 server1 kernel: [517262.504888] [ 1342] 104 1342 24484 281 49 0 0 ntpd
Jun 23 17:20:10 server1 kernel: [517262.504890] [ 1440] 0 1440 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504891] [ 1443] 0 1443 3955 41 12 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504893] [ 1448] 0 1448 3955 39 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504895] [ 1450] 0 1450 3955 41 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504896] [ 1452] 0 1452 3955 42 13 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504898] [ 1469] 0 1469 4785 40 13 0 0 atd
Jun 23 17:20:10 server1 kernel: [517262.504900] [ 1470] 0 1470 15341 168 32 0 -1000 sshd
Jun 23 17:20:10 server1 kernel: [517262.504902] [ 1472] 0 1472 5914 65 17 0 0 cron
Jun 23 17:20:10 server1 kernel: [517262.504904] [ 1478] 999 1478 16020 3710 31 0 0 gmond
Jun 23 17:20:10 server1 kernel: [517262.504905] [ 1486] 0 1486 4821 65 14 0 0 irqbalance
Jun 23 17:20:10 server1 kernel: [517262.504907] [ 1500] 0 1500 343627 1730 85 0 0 nscd 743,1 1%Jun 23 17:20:10 server1 kernel: [517262.504909] [ 1559] 0 1559 1092 37 8 0 0 acpid
Jun 23 17:20:10 server1 kernel: [517262.504911] [ 1641] 0 1641 4978 71 13 0 0 master
Jun 23 17:20:10 server1 kernel: [517262.504913] [ 1650] 103 1650 5427 72 14 0 0 qmgr
Jun 23 17:20:10 server1 kernel: [517262.504917] [ 1895] 0 1895 1900 30 9 0 0 getty
Jun 23 17:20:10 server1 kernel: [517262.504919] [ 1906] 1000 1906 2854329 2610 2594 0 0 thttpd
Jun 23 17:20:10 server1 kernel: [517262.504927] [ 3163] 1000 3163 2432 39 10 0 0 searchd
Jun 23 17:20:10 server1 kernel: [517262.504928] [ 3167] 1000 3167 2727221 2467025 4863 0 0 sphinx-daemon
Jun 23 17:20:10 server1 kernel: [517262.504931] [47622] 1000 47622 17834794 17329575 33989 0 0 MyExec
<.................Trimmed bunch of processes with low mem usage.......................................>
Jun 23 17:20:10 server1 kernel: [517262.508350] Out of memory: Kill process 47622 (MyExec) score 526 or sacrifice child
Jun 23 17:20:10 server1 kernel: [517262.508375] Killed process 47622 (MyExec) total-vm:71339176kB, anon-rss:69318300kB, file-rss:0kB
Looking at following lines, it seems like issue is fragmentation.
Jun 23 17:20:10 server1 kernel: [517262.504816] Node 0 Normal: 9038*4kB (M) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36152kB
Jun 23 17:20:10 server1 kernel: [517262.504822] Node 1 Normal: 9055*4kB (UM) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36220kB
I have no idea as why the system would be so badly fragmented. It was only running for 5 days when this happened. Also looking at the process that invoked the oom killer (gmond invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0), seems like it was only requesting 4K blocks and there are bunch of those available.
One thing that you can notice is, I have completely turned off swap and have swappiness set to 0. The reason is my system has more than enough RAM and should never hit swap. I am planning to enable it and set swappiness to 10. I am not sure if that helps in this case.
Thanks for your input.
Upvotes: 1
Views: 1102
Reputation: 11
Updating with slabinfo This is after the node was rebooted.
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kvm_async_pf 0 0 136 30 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0 0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 168 48 2 : tunables 0 0 0 : slabdata 0 0 0
fusion_ioctx 5005 5005 296 55 4 : tunables 0 0 0 : slabdata 91 91 0
fusion_user_ll_request 0 0 3960 8 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_groupinfo_4k 131670 131670 136 30 1 : tunables 0 0 0 : slabdata 4389 4389 0
ip6_dst_cache 1260 1260 384 42 4 : tunables 0 0 0 : slabdata 30 30 0
UDPLITEv6 0 0 1088 30 8 : tunables 0 0 0 : slabdata 0 0 0
UDPv6 330 330 1088 30 8 : tunables 0 0 0 : slabdata 11 11 0
tw_sock_TCPv6 128 128 256 32 2 : tunables 0 0 0 : slabdata 4 4 0
TCPv6 288 288 1984 16 8 : tunables 0 0 0 : slabdata 18 18 0
kcopyd_job 0 0 3312 9 8 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2632 12 8 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 52 4 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
fuse_request 0 0 416 39 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 768 42 8 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
ecryptfs_inode_cache 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 712 46 8 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 54 54 600 54 8 : tunables 0 0 0 : slabdata 1 1 0
jbd2_journal_handle 2040 2040 48 85 1 : tunables 0 0 0 : slabdata 24 24 0
jbd2_journal_head 5071 5364 112 36 1 : tunables 0 0 0 : slabdata 149 149 0
jbd2_revoke_table_s 1792 1792 16 256 1 : tunables 0 0 0 : slabdata 7 7 0
jbd2_revoke_record_s 1536 1536 32 128 1 : tunables 0 0 0 : slabdata 12 12 0
ext4_inode_cache 75129 78771 984 33 8 : tunables 0 0 0 : slabdata 2387 2387 0
ext4_free_data 5952 6656 64 64 1 : tunables 0 0 0 : slabdata 104 104 0
ext4_allocation_context 768 768 128 32 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_io_end 1344 1344 72 56 1 : tunables 0 0 0 : slabdata 24 24 0
ext4_extent_status 37921 38352 40 102 1 : tunables 0 0 0 : slabdata 376 376 0
dquot 768 768 256 32 2 : tunables 0 0 0 : slabdata 24 24 0
dnotify_mark 782 782 120 34 1 : tunables 0 0 0 : slabdata 23 23 0
pid_namespace 0 0 2192 14 8 : tunables 0 0 0 : slabdata 0 0 0
posix_timers_cache 0 0 248 33 2 : tunables 0 0 0 : slabdata 0 0 0
UDP-Lite 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
xfrm_dst_cache 0 0 448 36 4 : tunables 0 0 0 : slabdata 0 0 0
ip_fib_trie 146 146 56 73 1 : tunables 0 0 0 : slabdata 2 2 0
UDP 828 828 896 36 8 : tunables 0 0 0 : slabdata 23 23 0
tw_sock_TCP 992 1152 256 32 2 : tunables 0 0 0 : slabdata 36 36 0
TCP 450 450 1792 18 8 : tunables 0 0 0 : slabdata 25 25 0
blkdev_queue 120 136 1896 17 8 : tunables 0 0 0 : slabdata 8 8 0
blkdev_requests 3358 3569 376 43 4 : tunables 0 0 0 : slabdata 83 83 0
blkdev_ioc 964 1287 104 39 1 : tunables 0 0 0 : slabdata 33 33 0
user_namespace 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
net_namespace 0 0 4736 6 8 : tunables 0 0 0 : slabdata 0 0 0
shmem_inode_cache 2112 2112 672 48 8 : tunables 0 0 0 : slabdata 44 44 0
ftrace_event_file 1196 1196 88 46 1 : tunables 0 0 0 : slabdata 26 26 0
taskstats 196 196 328 49 4 : tunables 0 0 0 : slabdata 4 4 0
proc_inode_cache 63037 63250 648 50 8 : tunables 0 0 0 : slabdata 1265 1265 0
sigqueue 1224 1224 160 51 2 : tunables 0 0 0 : slabdata 24 24 0
bdev_cache 819 819 832 39 8 : tunables 0 0 0 : slabdata 21 21 0
kernfs_node_cache 54360 54360 112 36 1 : tunables 0 0 0 : slabdata 1510 1510 0
mnt_cache 510 510 320 51 4 : tunables 0 0 0 : slabdata 10 10 0
inode_cache 16813 19712 584 28 4 : tunables 0 0 0 : slabdata 704 704 0
dentry 144206 144606 192 42 2 : tunables 0 0 0 : slabdata 3443 3443 0
iint_cache 0 0 72 56 1 : tunables 0 0 0 : slabdata 0 0 0
buffer_head 6905641 6922305 104 39 1 : tunables 0 0 0 : slabdata 177495 177495 0
vm_area_struct 16764 16764 184 44 2 : tunables 0 0 0 : slabdata 381 381 0
mm_struct 1008 1008 896 36 8 : tunables 0 0 0 : slabdata 28 28 0
files_cache 1377 1377 640 51 8 : tunables 0 0 0 : slabdata 27 27 0
signal_cache 1380 1380 1088 30 8 : tunables 0 0 0 : slabdata 46 46 0
sighand_cache 1020 1020 2112 15 8 : tunables 0 0 0 : slabdata 68 68 0
task_xstate 1638 1638 832 39 8 : tunables 0 0 0 : slabdata 42 42 0
task_struct 837 855 6480 5 8 : tunables 0 0 0 : slabdata 171 171 0
Acpi-ParseExt 2968 2968 72 56 1 : tunables 0 0 0 : slabdata 53 53 0
Acpi-State 561 561 80 51 1 : tunables 0 0 0 : slabdata 11 11 0
Acpi-Namespace 3162 3162 40 102 1 : tunables 0 0 0 : slabdata 31 31 0
anon_vma 19313 19584 64 64 1 : tunables 0 0 0 : slabdata 306 306 0
shared_policy_node 7735 7735 48 85 1 : tunables 0 0 0 : slabdata 91 91 0
numa_policy 170 170 24 170 1 : tunables 0 0 0 : slabdata 1 1 0
radix_tree_node 2870899 2871624 584 28 4 : tunables 0 0 0 : slabdata 102558 102558 0
idr_layer_cache 555 555 2112 15 8 : tunables 0 0 0 : slabdata 37 37 0
dma-kmalloc-8192 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4096 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2048 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1024 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 42 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8192 180 180 8192 4 8 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-4096 636 720 4096 8 8 : tunables 0 0 0 : slabdata 90 90 0
kmalloc-2048 6498 6688 2048 16 8 : tunables 0 0 0 : slabdata 418 418 0
kmalloc-1024 4677 4800 1024 32 8 : tunables 0 0 0 : slabdata 150 150 0
kmalloc-512 9029 9056 512 32 4 : tunables 0 0 0 : slabdata 283 283 0
kmalloc-256 31542 31840 256 32 2 : tunables 0 0 0 : slabdata 995 995 0
kmalloc-192 16548 16548 192 42 2 : tunables 0 0 0 : slabdata 394 394 0
kmalloc-128 8449 8544 128 32 1 : tunables 0 0 0 : slabdata 267 267 0
kmalloc-96 20607 21462 96 42 1 : tunables 0 0 0 : slabdata 511 511 0
kmalloc-64 71408 75968 64 64 1 : tunables 0 0 0 : slabdata 1187 1187 0
kmalloc-32 5760 5760 32 128 1 : tunables 0 0 0 : slabdata 45 45 0
kmalloc-16 13824 13824 16 256 1 : tunables 0 0 0 : slabdata 54 54 0
kmalloc-8 45056 45056 8 512 1 : tunables 0 0 0 : slabdata 88 88 0
kmem_cache_node 551 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
kmem_cache 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
Upvotes: 0
Reputation: 6786
From the last few lines of the logs you can see the kernel reports a total-vm usage 71339176kB (~71GiB) while total vm should include both your physical memory and swap space. Also your log shows resident memory about ~69GiB.
Is my understanding of fragmentation correct in this case?
If your capturing system diagnostics during the time the issue occured or sosreport, check the /proc/buddyinfo
file for any memory fragmentation. Its best to write a script and backup this info if you are planning to reproducing this.
How can I figure why the memory got so fragmented? What can I do to avoid getting into this situation. Sometimes applications overcommit memory which the system is unable to honour potentially leading to OOM. You may want to modify and check the other kernel tunable or try to disable memory overcommitting using
sysctl -a
for reading the set values.
vm.overcommit_memory=2
vm.overcommit_ratio=80
Note: After adding the above lines in /etc/sysctl.conf
its best to restart the system.
vm.overcommit: some apps require to alloc more virtual memory for the program, more then what is available on the system.
vm.overcommit take different value, 0 - a heuristic overcommit algorithm is used
1 - always overcommit regardless of whether memory is available or not (most likely set on your server its set to 0 or 1).
2 - this tell the kernel to allow apps to commit all swap + %of ram, for this the below value should also be set (ex: set to 80%)
2- using this would disallow overcommiting the memory usage (beyond the available RAM + 80% of swap space)
Upvotes: 1
Reputation: 3935
Understanding of fragmentation is incorrect. The oom was issued because of memory watermarks were broken. Take a look at this:
Node 0 Normal free:34728kB min:42952kB low:53688kB
Node 1 Normal free:33484kB min:45096kB low:56368kB
Upvotes: 1