Reputation: 7476
Let say I have HDF5 dataset with maxshape=(None,1000), chunk=(1,1000).
Then whenever I need to delete a some row I just zero-it (many):
ds[ix,:] = 0
What is the fastest way to vacuum-zeroth-rows and resize the array ?
Now lets add a twist. I have a dict to resolve symbols =to=> ds_ix
{ name : ds_ix }..
What is the fastest way to vacuum and keep the correct ds_ix ?
Upvotes: 0
Views: 403
Reputation: 8006
Did you mean resize the dataset when you asked 'resize the array?' (Also, I assume you meant maxshape=(None,1000)
.) If so, you use the .resize()
method. However, if you aren't removing the last row(s), you will have to rearrange the non-zero data, then resize. (And you really don't need to zero out the row(s) since you are going to overwrite them.)
I can think of 2 approaches to rearrange the data: 1) use slice notation to define FROM and TO indices, or 2) read the dataset into a numpy array, delete the rows, and copy it back. Both involve disk I/O so it's not clear which would be faster without testing. It probably doesn't matter for small datasets and only a few deleted rows. I suspect the second method will be better if you plan to delete a lot of rows from large datasets. However, benchmark tests are required to confirm.
Note: be careful setting chunksize. Remember this controls the I/O size, and you will be doing a lot of I/O when you move rows. Setting it too small (or too large) can degrade performance. Setting to (1,1000) is probably too small. Recommended chunk size is 10 KiB to 1 MiB. (1,1000) float32 is 4 Kib.
Here are both approaches with a very small dataset.
Create a HDF5 file:
with h5py.File('SO_73353006.h5','w') as h5f:
a0, a1 = 10, 5
arr = np.arange(a0*a1).reshape(a0,a1)
ds = h5f.create_dataset('test',data=arr,maxshape=(None,a1))
Method 1: move data, then resize dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
#ds[idx,:] = 0 # Not required since we will overwrite the row
a0 = ds.shape[0]
ds[idx:a0-1] = ds[idx+1:a0]
ds.resize(a0-1,axis=0)
Method 2: extract array, delete row and copy data to resized dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
a0 = ds.shape[0]
a1 = ds.shape[1]
# read dataset into array and delete row
ds_arr = ds[()]
ds_arr = np.delete(ds_arr, obj=idx, axis=0)
# Resize dataset and load array
ds.resize(a0-1,axis=0) # same as above
ds[:] = ds_arr[:]
# Create a new dataset for comparison
ds2 = h5f.create_dataset('test2',data=ds_arr,maxshape=(None,a1))
Upvotes: 1