Diego
Diego

Reputation: 39

csv to netCDF produces .nc files 4X larger than the original .csv

I have many large .csv files that I want to convert to .nc (i.e. netCDF files) using xrray. However, I found that saving the .nc files takes a very long time, and the resulting .nc files are much larger (4x to 12x larger) than the original .csv files.

Below is sample code to show how the same data produces .nc files that are about 4 times larger than when saved in .csv

import pandas as pd
import xarray as xr
import numpy as np
import os

# Create pandas DataFrame 
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100000,5)),
                   columns=['a', 'b', 'c', 'd', 'e'])

# Make 'e' a column of strings
df['e'] = df['e'].astype(str)

# Save to csv
df.to_csv('df.csv')

# Convert to an xarray's Dataset
ds = xr.Dataset.from_dataframe(df)

# Save NetCDF file
ds.to_netcdf('ds.nc')

# Compute stats
stats1 = os.stat('df.csv')
stats2 = os.stat('ds.nc')
print('csv=',str(stats1.st_size))
print('nc =',str(stats2.st_size))
print('nc/csv=',str(stats2.st_size/stats1.st_size))

The result:

>>> csv = 1688902 bytes
>>>  nc = 6432441 bytes
>>> nc/csv = 3.8086526038811015

As you can see, the .nc file is about 4 times larger than the .csv file.

I found this post suggesting that changing from type 'string' to type 'char' drastically reduces file size, but how to I do this in xarray?

Also, note that even when having all data as Integers (i.e. comment-out df['e'] = df['e'].astype(str)) the resulting .nc file is still 50% larger than .csv

Am I missing a compression setting? ...or something else?

Upvotes: 0

Views: 1249

Answers (2)

Alex338207
Alex338207

Reputation: 1905

As you use only variabled from 0 to 9, in the CSV file 1 byte are sufficient to store the data. xarray, uses int64 (8 bytes) per default for integers.

To tell xarray to use 1-byte integers, you can use this:

 ds.to_netcdf('ds2.nc',encoding = {'a':{'dtype': 'int8'},
      'b':{'dtype': 'int8'}, 'c':{'dtype': 'int8'}, 
      'd':{'dtype': 'int8'}, 'e':{'dtype': 'S1'}})

The resulting file is 1307618 bytes. Compression will reduce the file size even more especially for non-random data :-)

Upvotes: 1

Diego
Diego

Reputation: 39

I found an answer to my own question...

  1. Enable compression for each variable
  2. For column e, specify that dtype is "character" (i.e. S1)

Before saving the .nc file, add the following code:

encoding = {'a':{'zlib':True},
            'b':{'zlib':True},
            'c':{'zlib':True},
            'd':{'zlib':True},
            'e':{'zlib':True, 'dtype':'S1'}}
ds.to_netcdf('ds.nc',format='NETCDF4',engine='netcdf4',encoding=encoding)

The new results are:

>>> csv = 1688902 bytes
>>>  nc = 1066182 bytes
>>> nc/csv = 0.6312870729029867

Note that it still takes a bit of time to save the .nc file.

Upvotes: 3

Related Questions