Reputation: 23
We have developed a python function that initiates a subprocess call with pdftoppm/pdftocairo to split pdfs and store each page as an individual image. Say if a document is 10 pages, it creates 10 individual png files each representing the page of the document. Is there a way to intercept the process from the terminal using htop
or ps -ef
commands?
Upvotes: 0
Views: 822
Reputation: 189377
If your Python program is still running when you want to reap the subprocesses, the simplest solution is probably to pass a timeout
keyword parameter to Popen.wait()
or Popen.communicate()
.
subprocs = []
for page in pdf.pages():
sub = subprocess.Popen(['pdftoppm', 'etc', '--page', str(page), filename])
subprocs.append(sub)
# some Python processing here while you wait for the subprocesses to run in the background?
# Then once you are done and only want to reap them before you continue
for sub in subprocs:
sub.wait(timeout=60)
When you wait
on a subprocess which has already finished, the call returns immediately. When you wait
on a subprocess which has already exceeded its timeout, that too should be (roughly) immediate. So the final for
loop should effectively wait for the first subprocess which hasn't yet finished or exceeded its timeout, and then rapidly reap the rest.
If your Python program has already finished executing and you have a bunch of subprocesses left running, the subprocesses you started will be orphans which get reparented to be children of PID 1, so you can no longer inspect the parent process and see that they are yours. If they all run in a specific directory which no other processes are executing in, that could be a good way to isolate them. (In subprocess.Popen()
you can pass in a directory with cwd=path_to_dir
.) On Linux, the /proc
filesystem lets you easily traverse the process tree and inspect individual processes. The cwd
entry in the process tree is a symlink to the directory where the process is running.
from pathlib import Path
for proc in Path('/proc').iterdir():
if all(x.isdigit() for x in proc.name):
if proc/'cwd'.readlink() == '/path/to/dir':
print(proc)
Unfortunately, Path.readlink()
was only introduced in Python 3.9; if you need this on a machine with an older Python version, try the more traditional os.path
spaghetti:
import os
for proc in os.listdir('/proc'):
if all(x.isdigit() for x in proc):
if os.readlink(os.path.join('/proc', proc, 'cwd')) == '/path/to/dir':
print(proc)
Note that /proc
is not portable, but since you specifically ask about Ubuntu, you should be able to use this approach.
If you don't want to run the subprocesses in a particular unique directory, there are probably other means to find your processes if they are reasonably unique, or to make them reasonably unique in order to facilitate this. Your question really doesn't reveal enough about your code or your requirements to know what exactly will work for you.
Perhaps you can just run the processes with an external timeout
command and leave it at that. The GNU Coreutils timeout
binary is part of the Ubuntu base install (but might not be available out of the box on some other U*x-like systems).
for page in pdf.pages():
subprocess.Popen(['timeout', '60', 'pdftoppm', 'etc', '--page', str(page), filename])
(The above obviously guesses wildly about the actual command you are running and what parameters it takes.)
Upvotes: 1
Reputation: 3361
If you actually ran the process via the subprocess module, then it should show up as a regular (child-) process, yes.
>>> from subprocess import run
>>> run('/usr/bin/cat')
Will result in:
$ ps -u myuser
...
36456 pts/2 00:00:00 python3
36463 pts/2 00:00:00 cat
...
Upvotes: 0