How to find which process is leaking file handles in Linux?

Question

The problem incident:

Our production system started denying services with an error message "Too many open files in system". Most of the services were affected, including inability to start a new ssh session, or even log in into virtual console from the physical terminal. Luckily, one root ssh session was open, so we could interact with the system (morale: keep one root session always open!). As a side effect, some services (named, dbus-daemon, rsyslogd, avahi-daemon) saturated the CPU (100% load). The system also serves a large directory via NFS to a very busy client which was backing up 50000 small files at the moment. Restarting all kinds of services and programs normalized their CPU behavior, but did not solve the "Too many open files in system" problem.

The suspected cause

Most likely, some program is leaking file handles. Probably the culprit is my tcl program, which also saturated the CPU (not normal). However, killing it did not help, but, most disturbingly, lsof would not reveal large amounts of open files.

Some evidence

We had to reboot, so whatever information was collected is all we have.

root@xeon:~# cat  /proc/sys/fs/file-max
205900
root@xeon:~# lsof

COMMAND     PID    USER   FD      TYPE     DEVICE   SIZE/OFF       NODE NAME
init          1    root  cwd       DIR        8,6       4096          2 /
init          1    root  rtd       DIR        8,6       4096          2 /
init          1    root  txt       REG        8,6     124704    7979050 /sbin/init
init          1    root  mem       REG        8,6      42580    5357606 /lib/i386-linux-gnu/libnss_files-2.13.so
init          1    root  mem       REG        8,6     243400    5357572 /lib/i386-linux-gnu/libdbus-1.so.3.5.4
...
A pretty normal list, definitely not 200K files, more like two hundred.

This is probably, where the problem started:

less /var/log/syslog

Mar 27 06:54:01 xeon CRON[16084]: (CRON) error (grandchild #16090 failed with exit status 1)
Mar 27 06:54:21 xeon kernel: [8848865.426732] VFS: file-max limit 205900 reached
Mar 27 06:54:29 xeon postfix/master[1435]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system
Mar 27 06:54:29 xeon kernel: [8848873.611491] VFS: file-max limit 205900 reached
Mar 27 06:54:32 xeon kernel: [8848876.293525] VFS: file-max limit 205900 reached

netstat did not show noticeable anomalies either. The man pages for ps and top do not indicate an ability to show open file count. Probably the problem will repeat itself after a few months (that was our uptime).

Any ideas on what else can be done to identify the open files?

UPDATE

This question has changed the meaning, after qehgt identified the likely cause.

Apart from the bug in NFS v4 code, I suspect there is a design limitation in Linux and kernel-leaked file handles can NOT be identified. Consequently, the original question transforms into: "Who is responsible for file handles in the Linux kernel?" and "Where do I post that question?". The 1st answer was helpful, but I am willing to accept a better answer.

qehgt · Accepted Answer

Probably the root cause is a bug in NFSv4 implementation: https://stackoverflow.com/a/5205459/280758

They have similar symptoms.

How to find which process is leaking file handles in Linux?

The problem incident:

The suspected cause

Some evidence

UPDATE

Answers (1)

Related Questions