Reputation: 723
So I have 3 netcdf4 files (each approx 90 MB), which I would like to concatenate using the package xarray. Each file has one variable (dis) represented at a 0.5 degree resolution (lat, lon) for 365 days (time). My aim is to concatenate the three files such that we have a timeseries of 1095 days (3 years).
Each file (for years 2007, 2008, 2009) has: 1 variable: dis 3 coordinates: time, lat, lon ... as such
<xarray.Dataset>
Dimensions: (lat: 360, lon: 720, time: 365)
Coordinates:
* lon (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 -177.25 ...
* lat (lat) float32 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 ...
* time (time) datetime64[ns] 2007-01-01 2007-01-02 2007-01-03 ...
Data variables:
dis (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...
I get them imported and use the concat module to concatenate, I think successfully. In this case the module reads out 3 netcdf filenames from filestrF
flist1 = [1,2,3]
ds_new = xr.concat([xr.open_dataset(filestrF[0,1,1,f]) for f in flist1],dim='time')
New details of the new dataset are shown to now be:
Dimensions: (lat: 360, lon: 720, time: 1095)
Seems fine to me. However, when I write this dataset back to a netcdf, the filesize has now exploded, with 1 year of data seemingly equivalent to 700 MB.
ds_new.to_netcdf('saved_on_disk1.nc')
I would have expected 3 x 90 MB = 270 MB - since we are scaling (3x) in one dimension(time). The variable, dis, and other dimensions lat and lon remain constant in size.
Any ideas please for the huge upscale in size? I have tested reading in and writing back out files without concatenation, and do this successfully with no increase in size.
Upvotes: 6
Views: 2470
Reputation: 3453
Presuming that time
is the record dimension, try using NCO's ncrcat to quickly concatenate the three files that should preserve compression.
ncrcat file1.nc file2.nc file3.nc -O concat.nc
Upvotes: 2
Reputation: 9603
The netCDF files you started with are compressed, probably using netCDF4's chunk-wise compression feature.
When you read a single dataset and write it back to disk, xarray writes that data back with the same compression settings. But when you combine multiple files, the compression settings are reset. Part of the reason for this is that different file may be compressed on disk in different ways, so it isn't obvious how the combined result should be handled.
To save the new netCDF file with compression, use the encoding
argument, as described in the xarray docs:
ds_new.to_netcdf('saved_on_disk1.nc', encoding={'dis': {'zlib': True}})
You will probably also want to manually specify the chunksizes
argument based on your expected access patterns for the data.
If you're curious how these files were compressed originally, you can pull that information out from the encoding
attribute, e.g., xr.open_dataset(filestrF[0,1,1,1]).dis.encoding
.
Upvotes: 7