Basj
Basj

Reputation: 46503

Modify only a few bytes from a npz numpy file without rewriting the whole file

This works to write and load a numpy array + metadata in a .npz compressed file (here the compression is useless because it's random, but anyway):

import numpy as np

# save
D = {"x": np.random.random((10000, 1000)), "metadata": {"date": "20221123", "user": "bob", "name": "abc"}}
with open("test.npz", "wb") as f:
    np.savez_compressed(f, **D)

# load
D2 = np.load("test.npz", allow_pickle=True)
print(D2["x"])
print(D2["metadata"].item()["date"])

Let's say we want to change only a metadata:

D["metadata"]["name"] = "xyz"

Is there a way to re-write to disk in test.npz only D["metadata"] and not the whole file because D["x"] has not changed?

In my case, the .npz file can be 100 MB to 4 GB large, that's why it would be interesting to rewrite only the metadata.

Upvotes: 3

Views: 448

Answers (1)

Mercury
Mercury

Reputation: 4171

Ultimately the solution that I could get to work (thus far) is the one I originally thought of with zipfile.

import zipfile
import os
from contextlib import contextmanager

@contextmanager
def archive_manager(archive_name: str, key: str):
    f, s = zipfile.ZipFile(archive_name, "a"), f"{key}.npy"

    yield s

    f.write(s)
    f.close()
    os.remove(s)

Let's say we want to change metadata:

new_metadata = {"date": "20221123", "user": "bob", "name": "xyz"}

with archive_manager("test.npz", "metadata") as archive:
    np.save(archive, new_metadata)

np.load returns an NpzFile, which is a lazy loader. However, NpzFile objects aren't directly writeable. We cannot also do something like D["metadata"] = new_metadata until D has been converted to a dict, and that loses the lazy functionality.

Upvotes: 2

Related Questions