Reputation: 41
When calling another libary to dask such as scikit image contrast stretch, I realise that dask is creating a result for each block, storing in either memory or spilling to disk seperately. Then it attempts to merge all the results. Thats fine if your on a cluster or on a single computer and the dataset for the array is small, everything is fairly controlled. The problems start to happen when you work with data sets that are much larger than your RAM or disk. Is there a way to mitigate this or use the zarr file format to save to updating values as you go along? May be thats too fanciful. Any other ideas bar buy more ram would be helpful.
edit
I was looking at the documentation on dask and the suggestions on chunk sizes for dask, is something like about 100MB. I ended up reducing significantly from this amount to 30-70MB depending on file size. I then ran a contrast stretch (not from a library but with numpy unfunc and I didnt have any issue! In fact i played with the way the compuation is done. Since I start with a uint8 3dim array, when multiplying by the ratio for contrast stretch I am inevitably increasing the array chunk to a float64 array. Which takes up significant memory and computation. So what I have been do is treating the da.array as np.asarray(float64) but only prior to the multiplication by a float number. Then returning to a uint8 to finish the computation. The stretch time has reduced to just under 5 mins for a 20GB file. So I think thats a positive step. Just means image processing without libraries, I will, have a look at rechunker though.
The image processing pipeline i am building is to inevitable be used for a merged dataset of about 250-300GB (definitely outside the limits of my laptop). I also dotn have time to get to grips with cloud or parralell processing in the cloud. Thats for a few months down the line. Right now its trying to get through this analysis.
Upvotes: 0
Views: 60
Reputation: 28684
Yes, you can do the kind of thing you are talking about. I encourage you to check out the rechunker project, which is specialied around changing the layout of the data in zarr storage, but shows the idea of how to save temporary intermediated for the purpose of mitigating memory and communication issues.
Upvotes: 1