Reputation: 336
I have an h5
file, which is basically model weights output by keras
. For some storage requirements, I'd like to split up the large h5
file into smaller pieces, and combine them back into a single file when needed. However, the way I do it seems to miss some "metadata" (not sure, maybe it's missing a lot more, but judging by the size of the combined file and the original file, it seems that I'm not missing much).
Here's my splitting script:
prefix = "model_weights"
fname_src = "DiffusiveSizeFactorAI/model_weights.h5"
size_max = 90 * 1024**2 # maximum size allowed in bytes
is_file_open = False
dest_fnames = []
idx = 0
with h5py.File(fname_src, "r") as src:
for group in src:
fname = f"{prefix}_{idx}.h5"
if not is_file_open:
dest = h5py.File(fname, "w")
dest_fnames.append(fname)
is_file_open = True
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
size = os.path.getsize(fname)
if size > size_max:
dest.close()
idx += 1
is_file_open = False
dest.close()
and here's the script that I use for combining back the pieces:
fname_combined = f"{prefix}_combined.h5"
with h5py.File(fname_combined, "w") as combined:
for fname in dest_fnames:
with h5py.File(fname, "r") as src:
for group in src:
group_id = combined.require_group(group)
src.copy(f"/{group}", group_id)
Just to add a little bit of context if it helps debugging my case, when I load the "combined" model weights, here's the error I'm getting:
ValueError: Layer count mismatch when loading weights from file. Model expected 108 layers, found 0 saved layers.
Note: the size of the original file and the combined one are about the same (they differ by less than 0.5%), which is why I think that I might be missing some metadata.
Upvotes: 1
Views: 2225
Reputation: 336
Based on an answer from h5py
developers, there are two issues:
h5
file is copied this way, a duplicate extra folder level will be added to the destination file. Let me explain:Suppose in src.h5
, I have the following structure: /A/B/C
. In these two lines:
group_id = dest.require_group(group)
src.copy(f"/{group}", group_id)
group
is /A
, and so, after copying, an extra /A
will be added to dest.h5
, which results in the following erroneous struction: /A/A/B/C
. To fix that, one needs to explicitly pass name="A"
as an argument to copy
.
h5
data structure is very similar to Python's dict
, you just need to add:dest.attrs.update(src.attrs)
For personal use, I've written two helper functions, one that splits up a large h5
file into smaller parts, each not exceeding a specified size (passed as argument by user), and another one that combines them back into a single h5
file. In case you find it useful, it can be found on Github here.
Upvotes: 1
Reputation: 46
I am wondering if there is an alternative solution to your problem. I am assuming you want to deploy the model on an embedded system, which leads to memory restrictions. If that is the case, here are some alternatives:
Use TensorFlow Lite: claims that it significantly reduces the size of the model (haven't really tested this). It also improves other important aspects of ML deployment on the edge. In summary, you can make the size up to x5 times smaller.
Apply Pruning: pruning gradually zeroes out model weights during the training process to achieve model sparsity. Sparse models are easier to compress, and thus the zeroes during inference can be skipped for latency improvements.
Upvotes: 1