Reputation: 39
I have many large .csv files that I want to convert to .nc (i.e. netCDF files) using xrray. However, I found that saving the .nc files takes a very long time, and the resulting .nc files are much larger (4x to 12x larger) than the original .csv files.
Below is sample code to show how the same data produces .nc files that are about 4 times larger than when saved in .csv
import pandas as pd
import xarray as xr
import numpy as np
import os
# Create pandas DataFrame
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100000,5)),
columns=['a', 'b', 'c', 'd', 'e'])
# Make 'e' a column of strings
df['e'] = df['e'].astype(str)
# Save to csv
df.to_csv('df.csv')
# Convert to an xarray's Dataset
ds = xr.Dataset.from_dataframe(df)
# Save NetCDF file
ds.to_netcdf('ds.nc')
# Compute stats
stats1 = os.stat('df.csv')
stats2 = os.stat('ds.nc')
print('csv=',str(stats1.st_size))
print('nc =',str(stats2.st_size))
print('nc/csv=',str(stats2.st_size/stats1.st_size))
The result:
>>> csv = 1688902 bytes
>>> nc = 6432441 bytes
>>> nc/csv = 3.8086526038811015
As you can see, the .nc file is about 4 times larger than the .csv file.
I found this post suggesting that changing from type 'string' to type 'char' drastically reduces file size, but how to I do this in xarray?
Also, note that even when having all data as Integers (i.e. comment-out df['e'] = df['e'].astype(str)
) the resulting .nc file is still 50% larger than .csv
Am I missing a compression setting? ...or something else?
Upvotes: 0
Views: 1249
Reputation: 1905
As you use only variabled from 0 to 9, in the CSV file 1 byte are sufficient to store the data. xarray, uses int64 (8 bytes) per default for integers.
To tell xarray to use 1-byte integers, you can use this:
ds.to_netcdf('ds2.nc',encoding = {'a':{'dtype': 'int8'},
'b':{'dtype': 'int8'}, 'c':{'dtype': 'int8'},
'd':{'dtype': 'int8'}, 'e':{'dtype': 'S1'}})
The resulting file is 1307618 bytes. Compression will reduce the file size even more especially for non-random data :-)
Upvotes: 1
Reputation: 39
I found an answer to my own question...
e
, specify that dtype
is "character" (i.e. S1
)Before saving the .nc file, add the following code:
encoding = {'a':{'zlib':True},
'b':{'zlib':True},
'c':{'zlib':True},
'd':{'zlib':True},
'e':{'zlib':True, 'dtype':'S1'}}
ds.to_netcdf('ds.nc',format='NETCDF4',engine='netcdf4',encoding=encoding)
The new results are:
>>> csv = 1688902 bytes
>>> nc = 1066182 bytes
>>> nc/csv = 0.6312870729029867
Note that it still takes a bit of time to save the .nc file.
Upvotes: 3