Reputation: 7054
We have a large pandas DataFrame (several Gb) that is kept in memory. The application is a machine learning web service, and it is replying to real-time queries.
Each query can contain up to 20,000 integers, designating the rows of the DataFrame that the query relates to. (Different users have access to different items because of the geographic location.) The service then selects those rows and does some fancy machine learning on them, in order to produce an answer for the user.
It is working fairly well, but now we want to scale it further. Currently it can only process one request at a time, asynchronously. We can just run several copies, but that would mean duplicating the data, and we may not have that much memory. What would be a good solution?
Upvotes: 1
Views: 1028
Reputation: 526
Are you using the DataFrame itself in the ML code, or are you creating NumPy arrays from the DataFrame which are then passed to the ML code?
If the latter, and assuming that the NumPy arrays are created quickly compared to the time required to run the ML code, I'd think the solution would be to have a single process to read the DataFrame and create the NumPy arrays, then make the arrays available to separate processes running the ML code via shared memory (see for instance the SharedArray package https://pypi.python.org/pypi/SharedArray), then use multiprocessing.Queue or multiprocessing.Pipe to synchronize the DF and ML processes.
Upvotes: 3