Justin Haynes
Justin Haynes

Reputation: 83

How in Python can I handle a SIGTERM only after the program has exited a critical section?

A Python 2.7 program called 'eventcollector' runs continuously and polls a webservice for events. It then appends each event as a JSON object to the end of a file - /var/log/eventsexample.json. An Agent follows the file and sends the events up to cloud based software called 'anycloud' that processes the events.

I need to make eventcollector a well behaved UNIX daemon and then make that daemon a service in systemd. The systemd .service unit I will create for this purpose will let systemd know that when stopping this service it must wait 15 seconds after sending SIGTERM before sending SIGKILL. This will give eventcollector time to save state and close the files it is writing (its own log file and the event file). awill be configured to I must now make this program more resilient. The program must be able to save its state so that when it is terminated and restarted, the program knows where it left off.

Eventcollector has no visibility into anycloud. it can only see events in the source service. If Eventcollector dies becuase of a restart, it must reliabilty know what its new start_time is to query the source service for events. Therefore finishing the critical business of writing events to the file before exiting and saving state is critical.

My question is specifically about how to handle the SIGTERM such that the program has time to finish what it is doing and then save its state.

My concern however, is that unless I write state after every message I write to the file (this would consume more resources than seems necessary), I cannot be sure my program won't be terminated without saving state in time. The impact of this would be duplicate messages, and duplicate messages are not acceptable.

If I must take the performance hit, I will, but I would prefer to have a way to handle a SIGTERM gracefully such that the program can smartly do the following for example (simplified pseudocode excerpt):

while true:
    response = query the webservice using method returning 
               a list of 100 dictionaries (events)
    for i in response.data:
        event = json.dumps(i)
        outputfile.write(i)  #<  Receive SIGTERM during 2nd event, but do not 
                                  exit until the for loop is done.  (how?)


signal handler:
    pickle an object with the current state.

The idea is that even if the SIGTERM were received while the 2nd event is being written, the program would wait until it had written the 100th event before deciding it is safe to handle the SIGTERM.

I read in https://docs.python.org/2/library/signal.html:

There is no way to “block” signals temporarily from critical sections (since this is not supported by all Unix flavors).

One idea I had seemed too complex, and it seemed to me that there must be an easier way. the Idea was:

  1. a main thread has a signal handler responsible for handling SIGTERM.
  2. The main thread can communicate with a worker thread through a novel protocol so that the worker thread tells the main thread when it is entering a critical section.
  3. When the main thread receives the SIGTERM, it waits until the worker thread tells the main thread it is out of its critical section. The main thread then tells it to save state and shutdown.
  4. When the worker thread finishes, it tells the main thread it is done. The main thread then exits cleanly and returns zero status.

Supplimental

I'm considering using python-daemon which I understand to be Ben Finney's reference implementation of the PEP he wrote [PEP 3143](https://www.python.org/dev/peps/pep-3143/>. I understand based on what he has written and also on what I have seen from my experiences with UNIX and UNIXlike OSes that what constitutes "good behavior" on the part of a daemon is not agreed upon. I mention this because, I do agree with PEP 3143, and would like to implement this, however it does not answer my current question about how to deal with signals as I would like to do.

Upvotes: 0

Views: 852

Answers (1)

James Li
James Li

Reputation: 504

your daemon was in python 2.7
and python is not so convenient to use when making syscalls, bad for /dev/shm , semaphores
and i do not sure about side effects and caveats in using global variables in python
file lock is fragile and file system IO is bad for signal handlers
so i do not have a perfect answer , only ideas.

here was my idea when i was implementing a small daemon in C

  1. main thread setup a synchronization point , for a C program , /dev/shm , semaphore, global variable , file lock were things that i have considered , and i chose /dev/shm in the end
  2. setup the signal handler , on receiving SIGTERM, raise the synchronization flag by changing the value stored in /dev/shm
  3. in every worker threads, check /dev/shm for synchronization flag after one portion of jobs , exit itself if the flag was raised
  4. in main thread , set up a harvesting thread that try to harvest every other worker threads, if it succeed in harvesting , go on to exit the daemon itself.

Upvotes: 1

Related Questions