Python xarray.concat then xarray.to_netcdf generates huge new file size

Question

So I have 3 netcdf4 files (each approx 90 MB), which I would like to concatenate using the package xarray. Each file has one variable (dis) represented at a 0.5 degree resolution (lat, lon) for 365 days (time). My aim is to concatenate the three files such that we have a timeseries of 1095 days (3 years).

Each file (for years 2007, 2008, 2009) has: 1 variable: dis 3 coordinates: time, lat, lon ... as such


Dimensions:  (lat: 360, lon: 720, time: 365)
Coordinates:
  * lon      (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 -177.25    ...
  * lat      (lat) float32 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 ...
  * time     (time) datetime64[ns] 2007-01-01 2007-01-02 2007-01-03 ...
Data variables:
    dis      (time, lat, lon) float64 nan nan nan nan nan nan nan nan nan ...

I get them imported and use the concat module to concatenate, I think successfully. In this case the module reads out 3 netcdf filenames from filestrF

flist1 = [1,2,3]
ds_new = xr.concat([xr.open_dataset(filestrF[0,1,1,f]) for f in flist1],dim='time')

New details of the new dataset are shown to now be:

Dimensions:  (lat: 360, lon: 720, time: 1095)

Seems fine to me. However, when I write this dataset back to a netcdf, the filesize has now exploded, with 1 year of data seemingly equivalent to 700 MB.

ds_new.to_netcdf('saved_on_disk1.nc')

For 2 concatenated files, ~1.5 GB
For 3 ,, ,, 2.2 GB
For 4 ,, ,, 2.9 GB

I would have expected 3 x 90 MB = 270 MB - since we are scaling (3x) in one dimension(time). The variable, dis, and other dimensions lat and lon remain constant in size.

Any ideas please for the huge upscale in size? I have tested reading in and writing back out files without concatenation, and do this successfully with no increase in size.

shoyer · Accepted Answer

The netCDF files you started with are compressed, probably using netCDF4's chunk-wise compression feature.

When you read a single dataset and write it back to disk, xarray writes that data back with the same compression settings. But when you combine multiple files, the compression settings are reset. Part of the reason for this is that different file may be compressed on disk in different ways, so it isn't obvious how the combined result should be handled.

To save the new netCDF file with compression, use the encoding argument, as described in the xarray docs:

ds_new.to_netcdf('saved_on_disk1.nc', encoding={'dis': {'zlib': True}})

You will probably also want to manually specify the chunksizes argument based on your expected access patterns for the data.

If you're curious how these files were compressed originally, you can pull that information out from the encoding attribute, e.g., xr.open_dataset(filestrF[0,1,1,1]).dis.encoding.

Python xarray.concat then xarray.to_netcdf generates huge new file size

Answers (2)

Related Questions