Reputation: 10389
I have a Python server application, which provides TensorFlow / Keras model inference services. Multiple different such models can be loaded and used at the same time, for multiple different clients. A client can request to load another model, but this has no effect on the other clients (i.e. their models stay in memory and use as they are, so each client can ask to load another model regardless of the state of any other client).
The logic and implementation works, however, I am not sure how to correctly free memory in this setup. When a client asks for a new model to load, then the previously loaded model will simply be deleted from memory (via the Python del
command), then the new model is being loaded via tensorflow.keras.models.load_model()
.
From what I read in the Keras documentation one might want to clear a Keras session in order to free memory via calling tf.keras.backend.clear_session()
. However, that seems to release all TF memory, which is a problem in my case, since other Keras models for other clients are still in use at the same time, as described above.
Moreover, it seems I cannot put every model into their own process, since I cannot access the single GPU from different running processes in parallel (or at all).
So in other words: When loading a new TensorFlow / Keras model while other models are also in memory and in use, how can I free the TF memory from the previsouly loaded model, without interferring with the other currently loaded models?
Upvotes: 3
Views: 10975
Reputation: 1824
When a Tensorflow session starts, it will try to allocate all of the GPU memory available. This is what prevents multiple processes from running sessions. The ideal way to stop this is to ensure that the tf session only allocates a part of the memory. From the docs, there are two ways to do this(Depending on your tf version)
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
for tf 2.0/2.1
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
for tf 1.* (Allocate 30% percentage of memory per process)
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Now you have to manually control placement using the with
gpus = tf.config.experimental.list_logical_devices('GPU')
if gpus:
# Replicate your computation on multiple GPUs
c = []
for gpu in gpus:
with tf.device(gpu.name):
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c.append(tf.matmul(a, b))
with tf.device('/CPU:0'):
matmul_sum = tf.add_n(c)
print(matmul_sum)
Using this you won't run of out memory and can run multiple processes at once.
Upvotes: 5
Reputation: 2621
You can fork new kernels by customers. Each process will execute operations and separated environments with each other. It is safer and isolated way.
I created a basic scenario which has two parts. The main part's responsibility to start, execute and kill processes. The client part's responsibility is executing operations from the server's orders. Each client waits for orders with HTTP requests.
main.py
import subprocess
import sys
import requests
class ClientOperator:
def __init__(self, name, port, model):
self.name = name
self.port = port
self.proc = subprocess.Popen([sys.executable, 'client.py',
f'--port={port}', f'--model={model}'])
def process(self, a, b):
response = requests.get(f'http://localhost:{self.port}/process',
params={'a': a, 'b': b}).json()
print(f'{self.name} process {a} + {b} = {response}')
def close(self):
print(f'{self.name} is closing')
self.proc.terminate()
customer1 = ClientOperator('John', 20001, 'model1.hdf5')
customer2 = ClientOperator('Oscar', 20002, 'model2.hdf5')
customer1.process(5, 10)
customer2.process(4, 6)
# stop customer1
customer1.close()
client.py
import argparse
from flask import Flask, request, jsonify
# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--port', '-p', type=int)
parser.add_argument('--model', '-m', type=str)
args = parser.parse_args()
model = args.model
app = Flask(__name__)
@app.route('/process', methods=['GET'])
def process():
result = int(request.args['a']) + int(request.args['b'])
return jsonify({'result': result, 'model': model})
if __name__ == '__main__':
app.run(host="localhost", port=args.port)
Output:
$ python main.py
* Serving Flask app "client" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://localhost:20002/ (Press CTRL+C to quit)
* Serving Flask app "client" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://localhost:20001/ (Press CTRL+C to quit)
127.0.0.1 - - [22/Jan/2021 16:31:26] "?[37mGET /process?a=5&b=10 HTTP/1.1?[0m" 200 -
John process 5 + 10 = {'model': 'model1.hdf5', 'result': 15}
127.0.0.1 - - [22/Jan/2021 16:31:27] "?[37mGET /process?a=4&b=6 HTTP/1.1?[0m" 200 -
Oscar process 4 + 6 = {'model': 'model2.hdf5', 'result': 10}
John is closing
Upvotes: -1