Concurrency: large pandas DataFrame used to serve real-time queries

Question

We have a large pandas DataFrame (several Gb) that is kept in memory. The application is a machine learning web service, and it is replying to real-time queries.

Each query can contain up to 20,000 integers, designating the rows of the DataFrame that the query relates to. (Different users have access to different items because of the geographic location.) The service then selects those rows and does some fancy machine learning on them, in order to produce an answer for the user.

It is working fairly well, but now we want to scale it further. Currently it can only process one request at a time, asynchronously. We can just run several copies, but that would mean duplicating the data, and we may not have that much memory. What would be a good solution?

write some module that would just hold the data, and send pickled DataFrames with 20,000 items to whoever needs them? If I am correct, there were some remote-procedure-call solutions in python using, I think, pyzmq.
use a blaze server to do the same thing (if it is possible).
try to put the dataframe into a shared memory space in some way. I am not quite sure how to do it properly, and I've already managed to get weird bugs whereas a DataFrame, if not properly locked, would get into an inconsistent state from read-only operations (presumably because of some internal data structures necessary to create the views?)
use a nonstandard python interpreter without a GIL, but one that does allow you to use pandas (do they even exist?), and then again, figure out how to protect the access...

Concurrency: large pandas DataFrame used to serve real-time queries

Answers (1)

Related Questions