Blindfreddy
Blindfreddy

Reputation: 702

How to instrument a python process which crashes after ~5 days without log entries

I am running a multi-process (and multi-threaded) python script on debian linux. One of the processes repeatedly crashes after 5 or 6 days. It is always the same, unique workload on the process that crashes. There are no entries in syslog about the crash - the process simply disappears silently. It also behaves completely normally and produces normal results, then suddenly stops.

How can I instrument the rogue process. Increasing the loglevel will produce large amounts of logs, so that's not my preferred option.

Upvotes: 0

Views: 150

Answers (1)

Blindfreddy
Blindfreddy

Reputation: 702

I used good-old log analysis to determine what happens when the process fails.

  1. increased log level of the rogue process to INFO after 4 days
  2. monitored the application for the rogue process failing
  3. pin-pointed the point in time of the failure in syslog
  4. analysed syslog at that time

I found following error at that time; first row is the last entry made by the rogue process (just before it fails), the 2nd row is the one pointing to the underlying error. In this case there is a problem with pyzmq bindings or zeromq library. I'll open a ticket with them.

Aug 10 08:30:13 rpi6 python[16293]: 2021-08-10T08:30:13.045 WARNING w1m::pid 16325, tid 16415, taking reading from sensors  with map {'000005ccbe8a': ['t-top'], '000005cc8eba': ['t-mid'], '00000676e5c3': ['t
Aug 10 08:30:14 rpi6 python[16293]: Too many open files (bundled/zeromq/src/ipc_listener.cpp:327)
A

Hope this helps someone in the future.

Upvotes: 1

Related Questions