Reputation: 1939
Our production system started denying services with an error message "Too many open files in system". Most of the services were affected, including inability to start a new ssh session, or even log in into virtual console from the physical terminal. Luckily, one root ssh session was open, so we could interact with the system (morale: keep one root session always open!). As a side effect, some services (named
, dbus-daemon
, rsyslogd
, avahi-daemon
) saturated the CPU (100% load). The system also serves a large directory via NFS
to a very busy client which was backing up 50000 small files at the moment. Restarting all kinds of services and programs normalized their CPU behavior, but did not solve the "Too many open files in system" problem.
Most likely, some program is leaking file handles. Probably the culprit is my tcl program, which also saturated the CPU (not normal). However, killing it did not help, but, most disturbingly, lsof
would not reveal large amounts of open files.
We had to reboot, so whatever information was collected is all we have.
root@xeon:~# cat /proc/sys/fs/file-max
205900
root@xeon:~# lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME init 1 root cwd DIR 8,6 4096 2 / init 1 root rtd DIR 8,6 4096 2 / init 1 root txt REG 8,6 124704 7979050 /sbin/init init 1 root mem REG 8,6 42580 5357606 /lib/i386-linux-gnu/libnss_files-2.13.so init 1 root mem REG 8,6 243400 5357572 /lib/i386-linux-gnu/libdbus-1.so.3.5.4 ... A pretty normal list, definitely not 200K files, more like two hundred.
This is probably, where the problem started:
less /var/log/syslog
Mar 27 06:54:01 xeon CRON[16084]: (CRON) error (grandchild #16090 failed with exit status 1)
Mar 27 06:54:21 xeon kernel: [8848865.426732] VFS: file-max limit 205900 reached
Mar 27 06:54:29 xeon postfix/master[1435]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system
Mar 27 06:54:29 xeon kernel: [8848873.611491] VFS: file-max limit 205900 reached
Mar 27 06:54:32 xeon kernel: [8848876.293525] VFS: file-max limit 205900 reached
netstat
did not show noticeable anomalies either.
The man pages for ps
and top
do not indicate an ability to show open file count. Probably the problem will repeat itself after a few months (that was our uptime).
Any ideas on what else can be done to identify the open files?
This question has changed the meaning, after qehgt identified the likely cause.
Apart from the bug in NFS v4 code, I suspect there is a design limitation in Linux and kernel-leaked file handles can NOT be identified. Consequently, the original question transforms into: "Who is responsible for file handles in the Linux kernel?" and "Where do I post that question?". The 1st answer was helpful, but I am willing to accept a better answer.
Upvotes: 2
Views: 8387
Reputation: 2990
Probably the root cause is a bug in NFSv4 implementation: https://stackoverflow.com/a/5205459/280758
They have similar symptoms.
Upvotes: 3