xYZ
xYZ

Reputation: 107

when will /proc/<pid> be removed?

Process A opened && mmaped thousand of files when running. Then killl -9 <pid of process A> is issued. Then I have a question about the sequence of below two events.
a) /proc/<pid of process A> cannot be accessed.
b) all files opened by process A are closed.

More background about the question:
Process A is a multi-thread background service. It is started by cmd ./process_A args1 arg2 arg3.
There is also a watchdog process which checked whether process A is still alive periodically(every 1 second). If process A is dead, then restart it. The way watchdog checks process A is as below.
1) collect all numerical subdir under /proc/
2) compares /proc/<all-pids>/cmdline with cmdline of process A. If these is a /proc/<some-pid>/cmdline matches, then process A is alive and do nothing, otherwise restart process A.

process A will do below stuff when doing initialization.
1) open fileA
2) flock fileA
3) mmap fileA into memory
4) close fileA
process A will mmap thousand of files after initialization. after several minutes, kill -9 <pid of process A> is issued. watchdog detect the death of process A, restart it. But sometimes process A stuck at step 2 flock fileA. After some debugging, we found that unlock of fileA is executed when process A is killed. But sometimes this event will happen after step 2 flock fileA of new process.
So we guess the way to check process alive by monitor /proc/<pid of process A> is not correct.

Upvotes: 1

Views: 2571

Answers (2)

John Zwinck
John Zwinck

Reputation: 249592

Don't scan /proc/PID to find out if a specific process has terminated. There are lots of better ways to do that, such as having your watchdog program actually launch the server program and wait for it to terminate.

Or, have the watchdog listen on a TCP socket, and have the server process connect to that and send its PID. If either end dies, the other can notice the connect was closed (hint: send a heartbeat packet every so often, to a frozen peer). If the watchdog receives a connection from another server while the first is still running, it can decide to allow it or tell one of the instances to shut down (via TCP or kill()).

Upvotes: 2

then kill -9 is issued

This is bad habit. You'll better send a SIGTERM first. Because well behaved processes and well designed programs can catch it (and exit nicely and properly when getting a SIGTERM...). In some cases, I even recommend: sending SIGTERM. Wait two or three seconds. sending SIGQUIT. Wait two seconds. At last, send a SIGKILL signal (for those bad programs who have not been written properly or are misbehaving). A few seconds later, you could send a SIGKILL. Read signal(7) and signal-safety(7). In multi-threaded, but Linux specific, programs, you might use signalfd(2) or the pipe(7) to self trick (well explained in Qt documentation, but not Qt specific).

If your Linux system is systemd based, you could imagine your program-A is started with systemd facilities. Then you'll use systemd facilities to "communicate" with it. In some ways (I don't know the details), systemd is making signals almost obsolete. Notice that signals are not multi-thread friendly and have been designed, in the previous century, for single-thread processes.

we guess the way to check process alive by monitor /proc/ is not correct.

The usual (and faster, and "atomic" enough) way to detect the existence of a process (on which you have enough privileges, e.g. which runs with your uid/gid) is to use kill(2) with a signal number (the second argument to kill) of 0. To quote that manpage:

   If sig is 0, then no signal is sent, but existence and permission
   checks are still performed; this can be used to check for the
   existence of a process ID or process group ID that the caller is
   permitted to signal.

Of course, that other process can still terminate before any further interaction with it. Because Linux has preemptive scheduling.

You watchdog process should better use kill(pid-of-process-A, 0) to check existence and liveliness of that process-A. Using /proc/pid-of-process-A/ is not the correct way for that.

And whatever you code, that process-A could disappear asynchronously (in particular, if it has some bug that gives a segmentation fault). When a process terminates (even with a segmentation fault) the kernel is acting on its file locks (and "releases" them).

Upvotes: 6

Related Questions