Handling a large number of requests that use an ML model

Question

I am building a chat-bot where every message a user sends needs to be converted to a vector(for other ML related work). I am using a pre-trained Word2Vec model to do this. The Word2Vec model was created using the Gensim library and is saved to disk as a 600MB file and is being used in a Django/Python web-application.

Every time a new message is received as an API request, a function loads the word2Vec model and uses that object to generate a vector of the message. This needs to happen on a real time basis. I am worried that every time a new message is received, the application loads an instance of the Word2Vec model and this would cause a memory problem if there are too many requests coming at the same time(because there will be multiple instance of the Word2Vec model present in the RAM at that time). How do I handle the memory efficiently such that it does not use too much memory?

Handling a large number of requests that use an ML model

Answers (1)

Related Questions