Python3: two dictionaries with numpy vectors of different size consume the same amount of RAM

Question

I have two python dictionaries {word: np.array(float)}, in the first dictionary I use 300-dimensional numpy vectors, in the second (keys are the same) - 150-dimensional. File size of the first one is 4.3 GB, of the second one - 2.2 GB.

When I check loaded objects with sys.getsizeof() I get:

import sys
import pickle
import numpy as np

For big dictionary:

with open("big.pickle", 'rb') as f:
    source = pickle.load(f)

sys.getsizeof(source)
#201326688

all(val.size==300 for key, val in source.items())
#True

Linux top command shows 6.22GB:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 4669 hcl       20   0 6933232 6,224g  15620 S   0,0 19,9   0:11.74 python3

For small dictionary:

with open("small.pickle", 'rb') as f:
    source = pickle.load(f)

sys.getsizeof(source)
# 201326688  # Strange!

all(val.size==150 for key, val in source.items())
#True

But when I look at the python3 process with linux top command I see 6.17GB:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 4515 hcl       20   0 6875596 6,170g  16296 S   0,0 19,7   0:08.77 python3

Both dictionaries were saved using pickle.HIGHEST_PROTOCOL in Python3, I do not want to use json because of possible errors with encoding and slow loading. Also, using numpy arrays is important for me, as I compute np.dot for these vectors.

How can I shrink RAM for the dictionary with smaller vectors in it?

More precise memory measurement:

#big:
sum(val.nbytes for key, val in source.items())
4456416000


#small:

sum(val.nbytes for key, val in source.items())
2228208000

EDIT: Thanks to @etene's hint, I've managed to save and load my model using hdf5:

Saving:

import pickle
import numpy as np
import h5py


with open("reduced_150_normalized.pickle", 'rb') as f:
    source = pickle.load(f)

# list to save order
keys = []
values = []

for k, v in source.items():
    keys.append(k)
    values.append(v)

values = np.array(values)
print(values.shape)

with open('model150_keys.pickle',"wb") as f:
    pickle.dump(keys, f,protocol=pickle.HIGHEST_PROTOCOL) # do not store stings in h5! Everything will hang

h5f = h5py.File('model150_values.h5', 'w')
h5f.create_dataset('model_values', data=values)


h5f.close()

Which produces keyphrases list with length 3713680 and vectors array with the shape (3713680, 150).

Loading:

import pickle
import numpy as np
import h5py

with open('model150_keys.pickle',"rb") as f:
    keys = pickle.load(f) # do not store stings in h5! Everything will hang

# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']

print(len(keys))
print(d.shape)

model = {}

for i,key in enumerate(keys):    
    model[key]=np.array(d[i,:])

h5f.close()

Now I indeed have only 3GB of RAM consumed:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 5012 hcl       20   0 3564672 2,974g  17800 S   0,0  9,5   4:25.27 python3

@etene, you can write your comment as answer, I will choose it.

The only problem left is that loading now takes considerable time (5min), perhaps because of lookup made in hdf5 file for each position in numpy array. If I can iterate hdf5 by the second coordinate somehow, without loading into RAM, that will be great.

EDIT2: Following @hpaulj 's suggestion, I loaded file in chunks, and it is now as fast as pickle or even quicker (4s) when chunk in 10k is used:

import pickle
import numpy as np
import h5py

with open('model150_keys.pickle',"rb") as f:
    keys = pickle.load(f) # do not store stings in h5! Everything will hang

# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']

print(len(keys))
print(d.shape)

model = {}

# we will load in chunks to speed up loading
for i,key in enumerate(keys):    
    if i%10000==0:
        data = d[i:i+10000,:]

    model[key]=data[i%10000,:]

h5f.close()

print(len(model))

Thanks everyone !!!

etene · Accepted Answer

Summarizing what we found out in the comments:

sys.getsizeof returning the same value for two dicts with the same keys is normal behavior. From the docs: "Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."
Deserializing all your data at once is what eats up so much RAM; this Numpy discussion thread mentions the HDF5 file format as a solution to read data in smaller batches, reducing memory usage.
However, reading in smaller batches can also have an impact on performance due to the disk i/o. Thanks to @hpaulj, @slowpoke was able to determine a larger batch size that worked for him.

TL;DR for future readers: If it's really large, don't deserialize your whole dataset at once, that can take unpredictable amounts of RAM. Use a specialized format such as HDF5 and chop your data in reasonably-sized batches, keeping in mind that smaller reads = more disk i/o and larger reads = more memory usage.

Python3: two dictionaries with numpy vectors of different size consume the same amount of RAM

Answers (1)

Related Questions