Reputation: 123
I am trying to make use of multiprocessing in python to process very large dataframe. Actually, I have noticed that memory consumption is increasing when I run the below code. In the code, the dataframe (df) is shared across all process and each process it using it to extract subDF from the large df based on some filters. The large df contains around 6 million records.
def prob(optimalK, mapping_dictionary,df,url):
subDF = df.loc[df['URL'] == url]
tmp_c = 0 * optimalK;
mapping_dictionary[url] = tmp_c
def grouper(n, iterable, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
if __name__ == '__main__':
.............
uniqueURLs = df.URL.unique();
manager = Manager()
mapping_dictionary=manager.dict()
numCores = multiprocessing.cpu_count()
print(numCores)
pool = Pool(processes=numCores)
for chunk in grouper(1000, uniqueURLs):
print("Start Processing 1000 .... ");
func = partial(prob, optimalK, mapping_dictionary,df)
pool.map(func, chunk)
print("End Processing 1000.... ");
pool.close()
pool.join()
Something interesting I noticed that what's really causing the memory consumption is this line in prob function --> subDF = df.loc[df['URL'] == url]
I am not sure why .loc increases the memory consumption this much. Can someone suggest a more efficient way to achieve the same purpose of .loc and how to make my code runs faster ....
Upvotes: 0
Views: 73
Reputation: 59416
Unfortunately, Python has the GIL problem (which you can Google). In short it means that no two Python threads can process data structures at the same time. It simply makes implementing the Python interpreter way less complex and way more robust.
Because of this, the solution often is to use several processes instead of threads.
The drawback of using processes, however, is that they don't share memory (even if you called it that). If they are supposed to work on the same data they will all need a separate copy of it.
This means two things in particular: ① Memory gets filled up (as you noticed), and ② writing in one process doesn't change the data structure for another process.
So, normally I would propose to switch to threads (which really share data structures) to tackle the issue, but as stated in the beginning, because of the GIL problem, several threads normally do not speed up things in Python. They are rather used for implementing reactiveness to various sources or to implement algorithms which need parallelism.
So in your case, if you can, I propose to live with the memory consumption or to switch to a different language or to search if you can find a Python module which does what you need out-of-the-box (using threads or more clever subprocessing internally).
If, however, your job is taking long and needs speeding up because of lots of network traffic (sending out lots of queries to various servers), then using threads instead of subprocesses can be a perfect solution for you as network traffic means a lot of waiting for the network during which other threads can run perfectly well.
Upvotes: 1