Reputation: 1079
I have two python dictionaries {word: np.array(float)}
, in the first dictionary I use 300-dimensional numpy vectors, in the second (keys are the same) - 150-dimensional. File size of the first one is 4.3 GB, of the second one - 2.2 GB.
When I check loaded objects with sys.getsizeof()
I get:
import sys
import pickle
import numpy as np
For big dictionary:
with open("big.pickle", 'rb') as f:
source = pickle.load(f)
sys.getsizeof(source)
#201326688
all(val.size==300 for key, val in source.items())
#True
Linux top
command shows 6.22GB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4669 hcl 20 0 6933232 6,224g 15620 S 0,0 19,9 0:11.74 python3
For small dictionary:
with open("small.pickle", 'rb') as f:
source = pickle.load(f)
sys.getsizeof(source)
# 201326688 # Strange!
all(val.size==150 for key, val in source.items())
#True
But when I look at the python3 process with linux top
command I see 6.17GB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4515 hcl 20 0 6875596 6,170g 16296 S 0,0 19,7 0:08.77 python3
Both dictionaries were saved using pickle.HIGHEST_PROTOCOL
in Python3, I do not want to use json because of possible errors with encoding and slow loading. Also, using numpy arrays is important for me, as I compute np.dot
for these vectors.
How can I shrink RAM for the dictionary with smaller vectors in it?
More precise memory measurement:
#big:
sum(val.nbytes for key, val in source.items())
4456416000
#small:
sum(val.nbytes for key, val in source.items())
2228208000
EDIT: Thanks to @etene's hint, I've managed to save and load my model using hdf5:
Saving:
import pickle
import numpy as np
import h5py
with open("reduced_150_normalized.pickle", 'rb') as f:
source = pickle.load(f)
# list to save order
keys = []
values = []
for k, v in source.items():
keys.append(k)
values.append(v)
values = np.array(values)
print(values.shape)
with open('model150_keys.pickle',"wb") as f:
pickle.dump(keys, f,protocol=pickle.HIGHEST_PROTOCOL) # do not store stings in h5! Everything will hang
h5f = h5py.File('model150_values.h5', 'w')
h5f.create_dataset('model_values', data=values)
h5f.close()
Which produces keyphrases list with length 3713680
and vectors array with the shape (3713680, 150)
.
Loading:
import pickle
import numpy as np
import h5py
with open('model150_keys.pickle',"rb") as f:
keys = pickle.load(f) # do not store stings in h5! Everything will hang
# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']
print(len(keys))
print(d.shape)
model = {}
for i,key in enumerate(keys):
model[key]=np.array(d[i,:])
h5f.close()
Now I indeed have only 3GB of RAM consumed:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5012 hcl 20 0 3564672 2,974g 17800 S 0,0 9,5 4:25.27 python3
@etene, you can write your comment as answer, I will choose it.
The only problem left is that loading now takes considerable time (5min), perhaps because of lookup made in hdf5 file for each position in numpy array. If I can iterate hdf5 by the second coordinate somehow, without loading into RAM, that will be great.
EDIT2: Following @hpaulj 's suggestion, I loaded file in chunks, and it is now as fast as pickle or even quicker (4s) when chunk in 10k is used:
import pickle
import numpy as np
import h5py
with open('model150_keys.pickle',"rb") as f:
keys = pickle.load(f) # do not store stings in h5! Everything will hang
# we will construct model by reading h5 file line-by-line
h5f = h5py.File('model150_values.h5','r')
d=h5f['model_values']
print(len(keys))
print(d.shape)
model = {}
# we will load in chunks to speed up loading
for i,key in enumerate(keys):
if i%10000==0:
data = d[i:i+10000,:]
model[key]=data[i%10000,:]
h5f.close()
print(len(model))
Thanks everyone !!!
Upvotes: 1
Views: 134
Reputation: 728
Summarizing what we found out in the comments:
dict
s with the same keys is normal behavior. From the docs: "Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."TL;DR for future readers: If it's really large, don't deserialize your whole dataset at once, that can take unpredictable amounts of RAM. Use a specialized format such as HDF5 and chop your data in reasonably-sized batches, keeping in mind that smaller reads = more disk i/o and larger reads = more memory usage.
Upvotes: 1