Matthias
Matthias

Reputation: 10389

How to free TF/Keras memory in Python after a model has been deleted, while other models are still in memory and in use?

I have a Python server application, which provides TensorFlow / Keras model inference services. Multiple different such models can be loaded and used at the same time, for multiple different clients. A client can request to load another model, but this has no effect on the other clients (i.e. their models stay in memory and use as they are, so each client can ask to load another model regardless of the state of any other client).

The logic and implementation works, however, I am not sure how to correctly free memory in this setup. When a client asks for a new model to load, then the previously loaded model will simply be deleted from memory (via the Python del command), then the new model is being loaded via tensorflow.keras.models.load_model().

From what I read in the Keras documentation one might want to clear a Keras session in order to free memory via calling tf.keras.backend.clear_session(). However, that seems to release all TF memory, which is a problem in my case, since other Keras models for other clients are still in use at the same time, as described above.

Moreover, it seems I cannot put every model into their own process, since I cannot access the single GPU from different running processes in parallel (or at all).

So in other words: When loading a new TensorFlow / Keras model while other models are also in memory and in use, how can I free the TF memory from the previsouly loaded model, without interferring with the other currently loaded models?

Upvotes: 3

Views: 10975

Answers (2)

Niteya Shah
Niteya Shah

Reputation: 1824

When a Tensorflow session starts, it will try to allocate all of the GPU memory available. This is what prevents multiple processes from running sessions. The ideal way to stop this is to ensure that the tf session only allocates a part of the memory. From the docs, there are two ways to do this(Depending on your tf version)

  1. The simple way is (tf 2.2+)
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
  tf.config.experimental.set_memory_growth(gpu, True)

for tf 2.0/2.1

import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)

for tf 1.* (Allocate 30% percentage of memory per process)

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
  1. The other method is more controlled IMHO and scales better. It requires that you create logical devices and manually control placement for each of them.
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
    try:
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
             tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
     except RuntimeError as e:
            # Virtual devices must be set before GPUs have been initialized
         print(e)

Now you have to manually control placement using the with

gpus = tf.config.experimental.list_logical_devices('GPU')
if gpus:
  # Replicate your computation on multiple GPUs
  c = []
  for gpu in gpus:
    with tf.device(gpu.name):
      a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
      b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
      c.append(tf.matmul(a, b))

  with tf.device('/CPU:0'):
    matmul_sum = tf.add_n(c)

  print(matmul_sum)

Using this you won't run of out memory and can run multiple processes at once.

Upvotes: 5

Ismail Durmaz
Ismail Durmaz

Reputation: 2621

You can fork new kernels by customers. Each process will execute operations and separated environments with each other. It is safer and isolated way.

I created a basic scenario which has two parts. The main part's responsibility to start, execute and kill processes. The client part's responsibility is executing operations from the server's orders. Each client waits for orders with HTTP requests.

main.py

import subprocess
import sys
import requests

class ClientOperator:
    def __init__(self, name, port, model):
        self.name = name
        self.port = port
        self.proc = subprocess.Popen([sys.executable, 'client.py', 
                                f'--port={port}', f'--model={model}'])
    
    def process(self, a, b):
        response = requests.get(f'http://localhost:{self.port}/process', 
                                params={'a': a, 'b': b}).json()

        print(f'{self.name} process {a} + {b} = {response}')

    def close(self):
        print(f'{self.name} is closing')
        self.proc.terminate()


customer1 = ClientOperator('John', 20001, 'model1.hdf5')
customer2 = ClientOperator('Oscar', 20002, 'model2.hdf5')

customer1.process(5, 10)
customer2.process(4, 6)

# stop customer1
customer1.close()

client.py

import argparse
from flask import Flask, request, jsonify

# parse arguments
parser = argparse.ArgumentParser()
parser.add_argument('--port', '-p', type=int)
parser.add_argument('--model', '-m', type=str)
args = parser.parse_args()

model = args.model

app = Flask(__name__)

@app.route('/process', methods=['GET'])
def process():
    result = int(request.args['a']) + int(request.args['b'])
    return jsonify({'result': result, 'model': model})


if __name__ == '__main__':
    app.run(host="localhost", port=args.port)

Output:

$ python main.py

 * Serving Flask app "client" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://localhost:20002/ (Press CTRL+C to quit)
 * Serving Flask app "client" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://localhost:20001/ (Press CTRL+C to quit)


127.0.0.1 - - [22/Jan/2021 16:31:26] "?[37mGET /process?a=5&b=10 HTTP/1.1?[0m" 200 -
John process 5 + 10 = {'model': 'model1.hdf5', 'result': 15}

127.0.0.1 - - [22/Jan/2021 16:31:27] "?[37mGET /process?a=4&b=6 HTTP/1.1?[0m" 200 -
Oscar process 4 + 6 = {'model': 'model2.hdf5', 'result': 10}

John is closing

Upvotes: -1

Related Questions