Why loading a pickle file into memory will take much more space?

Question

I have a folder contains 7603 files saved by pickle.dump. The average file size is 6.5MB, so the total disk space the files take is about 48GB.

Each file is obtained by pickling a list object, the list has a structure of

[A * 50] 
 A = [str, int, [92 floats], B * 3] 
                             B = [C * about 6] 
                                  C = [str, int, [92 floats]]

The memory of the computer I'm using is 128GB.

However, I cannot load all the files in the folder into memory by this script:

import pickle
import multiprocessing as mp
import sys
from os.path import join
from os import listdir
import os

def one_loader(the_arg):
    with open(the_arg, 'rb') as source:
        temp_fp = pickle.load(source)
    the_hash = the_arg.split('/')[-1]
    os.system('top -bn 1 | grep buff >> memory_log')
    return (the_hash, temp_fp)

def process_parallel(the_func, the_args):
    pool = mp.Pool(25)
    result = dict(pool.map(the_func, the_args))
    pool.close()
    return result

node_list = sys.argv[-1]
db_path =  db_path
the_hashes = listdir(db_path)
the_files = [join(db_path, item) for item in the_hashes]
fp_dict = {}
fp_dict = process_parallel(one_loader, the_files)

I have plotted the memory usage as you can see from the script, the memory usage is

I have several confusions about this plot:

4000 files take 25GB disk space, but why they take more than 100GB memory?
After the sudden drop of the memory usage, I received no error, and I can see the script was still running by using top command. But I have completely no idea of what the system was doing, and where are the rest of the memories.

Jean-Fran&#231;ois Fabre · Accepted Answer

That is just because serialized data takes less space than the space in memory needed to manage the object when running.

Example with a string:

import pickle

with open("foo","wb") as f:
    pickle.dump("toto",f)

foo is 14 bytes on the disk (including pickle header or whatever) but in memory it's much bigger:

>>> import sys
>>> sys.getsizeof('toto')
53

for a dictionary it's even worse, because of the hash tables (and other stuff):

import pickle,os,sys

d = {"foo":"bar"}
with open("foo","wb") as f:
    pickle.dump(d,f)
print(os.path.getsize("foo"))
print(sys.getsizeof(d))

result:

27
288

so a 1 to 10 ratio.

Why loading a pickle file into memory will take much more space?

Answers (1)

Related Questions