Reputation: 7676
I have a folder contains 7603 files saved by pickle.dump
. The average file size is 6.5MB
, so the total disk space the files take is about 48GB
.
Each file is obtained by pickling a list object, the list has a structure of
[A * 50]
A = [str, int, [92 floats], B * 3]
B = [C * about 6]
C = [str, int, [92 floats]]
The memory of the computer I'm using is 128GB
.
However, I cannot load all the files in the folder into memory by this script:
import pickle
import multiprocessing as mp
import sys
from os.path import join
from os import listdir
import os
def one_loader(the_arg):
with open(the_arg, 'rb') as source:
temp_fp = pickle.load(source)
the_hash = the_arg.split('/')[-1]
os.system('top -bn 1 | grep buff >> memory_log')
return (the_hash, temp_fp)
def process_parallel(the_func, the_args):
pool = mp.Pool(25)
result = dict(pool.map(the_func, the_args))
pool.close()
return result
node_list = sys.argv[-1]
db_path = db_path
the_hashes = listdir(db_path)
the_files = [join(db_path, item) for item in the_hashes]
fp_dict = {}
fp_dict = process_parallel(one_loader, the_files)
I have plotted the memory usage as you can see from the script, the memory usage is
I have several confusions about this plot:
4000 files take 25GB
disk space, but why they take more than 100GB
memory?
After the sudden drop of the memory usage, I received no error, and I can see the script was still running by using top
command. But I have completely no idea of what the system was doing, and where are the rest of the memories.
Upvotes: 4
Views: 4213
Reputation: 140316
That is just because serialized data takes less space than the space in memory needed to manage the object when running.
Example with a string:
import pickle
with open("foo","wb") as f:
pickle.dump("toto",f)
foo
is 14 bytes on the disk (including pickle header or whatever) but in memory it's much bigger:
>>> import sys
>>> sys.getsizeof('toto')
53
for a dictionary it's even worse, because of the hash tables (and other stuff):
import pickle,os,sys
d = {"foo":"bar"}
with open("foo","wb") as f:
pickle.dump(d,f)
print(os.path.getsize("foo"))
print(sys.getsizeof(d))
result:
27
288
so a 1 to 10 ratio.
Upvotes: 4