Mark
Mark

Reputation: 1309

Python h5py Virtual Dataset - Concatenate/Append, not stack

I've recently started to use Virtual Datasets (VDS) in Python using h5py. All seems fairly straight forward and it certainly avoids the need for data duplication and the file size growing as a result.

Most of the examples I've seen are like the one below.

layout = h5py.VirtualLayout(shape=(4, 100), dtype='i4')

for n in range(1, 5):
    filename = "{}.h5".format(n)
    vsource = h5py.VirtualSource(filename, 'data', shape=(100,))
    layout[n - 1] = vsource

# Add virtual dataset to output file
with h5py.File("VDS.h5", 'w', libver='latest') as f:
    f.create_virtual_dataset('data', layout, fillvalue=-5)

They tend to take several data sources (in this case from separate hdf5 files) and create a single VDS in which the data is 'stacked' together. By this, I mean it takes four arrays each (100,) in size and creates a single VDS of size (4, 100).

I'm looking to create a VDS which is (400,) in size, essentially concatenating the four (100,) arrays together, end-on-end, in a single VDS. How do I do this?

Upvotes: 1

Views: 457

Answers (1)

kcw78
kcw78

Reputation: 8006

Here you go, 4 files each with a dataset of shape=(100,) combined into a single virtual dataset of shape=(400,). The trick is to use slice notation when you map the virtual source to the virtual layout, as done on this line: layout[n*100:(n+1)*100] = vsource

# Create source files (0.h5 to 3.h5)
a0 = 4
for n in range(a0):
# create some sample data
    arr = (n+1)*np.arange(1,101)
    with h5py.File(f"{n}.h5", "w") as f:
        d = f.create_dataset("data", data=arr)

# Assemble virtual datasets
layout = h5py.VirtualLayout(shape=(a0*100,), dtype="i4")
for n in range(a0):
    vsource = h5py.VirtualSource(f"{n}.h5", "data", shape=(100,))
    layout[n*100:(n+1)*100] = vsource

# Add virtual dataset to VDS file
with h5py.File("VDS.h5", "w") as f:
    f.create_virtual_dataset("vdata", layout, fillvalue=-1)

# read data back
# virtual dataset is transparent for reader!
with h5py.File("VDS.h5", "r") as f:
    print("\nVDS Shape: ", f["vdata"].shape)
    print("\nFirst 10 Elements of Virtual dataset:")
    print(f["vdata"][:10])
    print("Last 10 Elements of Virtual dataset:")
    print(f["vdata"][-10:])

Upvotes: 2

Related Questions