Abraham Lopez
Abraham Lopez

Reputation: 1

GPU Memory Issues Handling Multiple Simultaneous Requests with Flask-SocketIO, uWSGI, and Hugging Face Model

I am putting a chatbot into production using Flask-SocketIO integrated with uWSGI and gevent. The chatbot uses a Hugging Face model that occupies about 2GB of GPU memory.

The issue arises when two users send questions or messages simultaneously through the sockets. In this case, only one of the users receives a complete response, while the other receives an incomplete response. The server throws an error indicating that the GPU is out of memory.

When I check the GPU memory, it appears that the model is being loaded again, even though it should already be in memory because when I start the server this model is loaded. Additionally, I cannot use more than 3GB of GPU memory as another application is occupying part of the GPU.

Technical Details: Frameworks and Tools: Flask-SocketIO uWSGI gevent Hugging Face Model Hardware: GPU with 8GB memory

uWSGI_config.ini

[uwsgi]
http = :8000
gevent = 1000
http-websockets = True
master = True
lazy-apps = True
wsgi-file = main.py
callable = app

Example app code in main.py:

from gevent import monkey
monkey.patch_all()
monky.patch_socket()
from flask import Flask, request
from flask_socketio import SocketIO
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

app = Flask(__name__)

socketio = SocketIO()
socketio.init_app(app, async_mode='gevent_uwsgi)
try:
  model_name = 'huggingface_model'
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda')
except Exception as e:
  print('Error: '+ str(e))
@app.route('/')
def index():
    return "Chatbot is active"

if __name__ == '__main__':
    socketio.run(app)

events.py

from main import socketio
@socketio.on('message')
def handle_message(data):
    user_input = data['message']
    inputs = tokenizer.encode(user_input, return_tensors='pt').to('cuda')
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    for i in response.split(' '):
       socketio.emit('response', {'message': response}, room = request.sid)

Question: How can I prevent the model from being unnecessarily reloaded and ensure the GPU can handle multiple simultaneous requests without running out of memory, considering that I cannot use more than 3GB of GPU memory due to another application occupying part of the GPU?

Any advice or solutions would be greatly appreciated. Thank you!

Monitored GPU memory usage and confirmed it exceeds limits with multiple simultaneous requests.

Upvotes: 0

Views: 87

Answers (0)

Related Questions