Slowpoke
Slowpoke

Reputation: 1079

Python3: two dictionaries with numpy vectors of different size consume the same amount of RAM

I have two python dictionaries {word: np.array(float)}, in the first dictionary I use 300-dimensional numpy vectors, in the second (keys are the same) - 150-dimensional. File size of the first one is 4.3 GB, of the second one - 2.2 GB.

When I check loaded objects with sys.getsizeof() I get:

import sys
import pickle
import numpy as np

For big dictionary:

with open("big.pickle", 'rb') as f:
    source = pickle.load(f)

sys.getsizeof(source)
#201326688

all(val.size==300 for key, val in source.items())
#True

Linux top command shows 6.22GB:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 4669 hcl       20   0 6933232 6,224g  15620 S   0,0 19,9   0:11.74 python3

For small dictionary:

with open("small.pickle", 'rb') as f:
    source = pickle.load(f)

sys.getsizeof(source)
# 201326688  # Strange!

all(val.size==150 for key, val in source.items())
#True

But when I look at the python3 process with linux top command I see 6.17GB:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 4515 hcl       20   0 6875596 6,170g  16296 S   0,0 19,7   0:08.77 python3 

Both dictionaries were saved using pickle.HIGHEST_PROTOCOL in Python3, I do not want to use json because of possible errors with encoding and slow loading. Also, using numpy arrays is important for me, as I compute np.dot for these vectors.

How can I shrink RAM for the dictionary with smaller vectors in it?

More precise memory measurement:

#big:
sum(val.nbytes for key, val in source.items())
4456416000


#small:

sum(val.nbytes for key, val in source.items())
2228208000

EDIT: Thanks to @etene's hint, I've managed to save and load my model using hdf5:

Saving:

import pickle
import numpy as np
import h5py


with open("reduced_150_normalized.pickle", 'rb') as f:
    source = pickle.load(f)

# list to save order
keys = []
values = []

for k, v in source.items():
    keys.append(k)
    values.append(v)

values = np.array(values)
print(values.shape)

with open('model150_keys.pickle',"wb") as f:
    pickle.dump(keys, f,protocol=pickle.HIGHEST_PROTOCOL) # do not store stings in h5! Everything will hang

h5f = h5py.File('model150_values.h5', 'w')
h5f.create_dataset('model_values', data=values)


h5f.close()

Which produces keyphrases list with length 3713680 and vectors array with the shape (3713680, 150).

Loading:

import pickle
import numpy as np
import h5py

with open('model150_keys.pickle',"rb") as f:
    keys = pickle.load(f) # do not store stings in h5! Everything will hang

# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']

print(len(keys))
print(d.shape)

model = {}

for i,key in enumerate(keys):    
    model[key]=np.array(d[i,:])

h5f.close()

Now I indeed have only 3GB of RAM consumed:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 5012 hcl       20   0 3564672 2,974g  17800 S   0,0  9,5   4:25.27 python3 

@etene, you can write your comment as answer, I will choose it.

The only problem left is that loading now takes considerable time (5min), perhaps because of lookup made in hdf5 file for each position in numpy array. If I can iterate hdf5 by the second coordinate somehow, without loading into RAM, that will be great.


EDIT2: Following @hpaulj 's suggestion, I loaded file in chunks, and it is now as fast as pickle or even quicker (4s) when chunk in 10k is used:

import pickle
import numpy as np
import h5py

with open('model150_keys.pickle',"rb") as f:
    keys = pickle.load(f) # do not store stings in h5! Everything will hang

# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']

print(len(keys))
print(d.shape)

model = {}

# we will load in chunks to speed up loading
for i,key in enumerate(keys):    
    if i%10000==0:
        data = d[i:i+10000,:]

    model[key]=data[i%10000,:]

h5f.close()

print(len(model))

Thanks everyone !!!

Upvotes: 1

Views: 134

Answers (1)

etene
etene

Reputation: 728

Summarizing what we found out in the comments:

  • sys.getsizeof returning the same value for two dicts with the same keys is normal behavior. From the docs: "Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."
  • Deserializing all your data at once is what eats up so much RAM; this Numpy discussion thread mentions the HDF5 file format as a solution to read data in smaller batches, reducing memory usage.
  • However, reading in smaller batches can also have an impact on performance due to the disk i/o. Thanks to @hpaulj, @slowpoke was able to determine a larger batch size that worked for him.

TL;DR for future readers: If it's really large, don't deserialize your whole dataset at once, that can take unpredictable amounts of RAM. Use a specialized format such as HDF5 and chop your data in reasonably-sized batches, keeping in mind that smaller reads = more disk i/o and larger reads = more memory usage.

Upvotes: 1

Related Questions