Identify Python Subprocess from terminal

Question

We have developed a python function that initiates a subprocess call with pdftoppm/pdftocairo to split pdfs and store each page as an individual image. Say if a document is 10 pages, it creates 10 individual png files each representing the page of the document. Is there a way to intercept the process from the terminal using htop or ps -ef commands?

tripleee · Accepted Answer

If your Python program is still running when you want to reap the subprocesses, the simplest solution is probably to pass a timeout keyword parameter to Popen.wait() or Popen.communicate().

subprocs = []
for page in pdf.pages():
   sub = subprocess.Popen(['pdftoppm', 'etc', '--page', str(page), filename])
   subprocs.append(sub)
# some Python processing here while you wait for the subprocesses to run in the background?
# Then once you are done and only want to reap them before you continue
for sub in subprocs:
   sub.wait(timeout=60)

When you wait on a subprocess which has already finished, the call returns immediately. When you wait on a subprocess which has already exceeded its timeout, that too should be (roughly) immediate. So the final for loop should effectively wait for the first subprocess which hasn't yet finished or exceeded its timeout, and then rapidly reap the rest.

If your Python program has already finished executing and you have a bunch of subprocesses left running, the subprocesses you started will be orphans which get reparented to be children of PID 1, so you can no longer inspect the parent process and see that they are yours. If they all run in a specific directory which no other processes are executing in, that could be a good way to isolate them. (In subprocess.Popen() you can pass in a directory with cwd=path_to_dir.) On Linux, the /proc filesystem lets you easily traverse the process tree and inspect individual processes. The cwd entry in the process tree is a symlink to the directory where the process is running.

from pathlib import Path

for proc in Path('/proc').iterdir():
  if all(x.isdigit() for x in proc.name):
    if proc/'cwd'.readlink() == '/path/to/dir':
      print(proc)

Unfortunately, Path.readlink() was only introduced in Python 3.9; if you need this on a machine with an older Python version, try the more traditional os.path spaghetti:

import os

for proc in os.listdir('/proc'):
  if all(x.isdigit() for x in proc):
    if os.readlink(os.path.join('/proc', proc, 'cwd')) == '/path/to/dir':
      print(proc)

Note that /proc is not portable, but since you specifically ask about Ubuntu, you should be able to use this approach.

If you don't want to run the subprocesses in a particular unique directory, there are probably other means to find your processes if they are reasonably unique, or to make them reasonably unique in order to facilitate this. Your question really doesn't reveal enough about your code or your requirements to know what exactly will work for you.

Perhaps you can just run the processes with an external timeout command and leave it at that. The GNU Coreutils timeout binary is part of the Ubuntu base install (but might not be available out of the box on some other U*x-like systems).

for page in pdf.pages():
    subprocess.Popen(['timeout', '60', 'pdftoppm', 'etc', '--page', str(page), filename])

(The above obviously guesses wildly about the actual command you are running and what parameters it takes.)

Identify Python Subprocess from terminal

Answers (2)

Related Questions