Hervé86
Hervé86

Reputation: 21

most efficient way to use open_mfdataset for opening many netcdf with xarray/dask (abnormaly slow in my case)

I have a directory of 363 netcdf files corresponding to different times, (all files have a similar internal structure, with a "time" dimension of 1), 270MB each, for a total of about 100GB. I want to upload all these data in a single xarray (with dask arrays and chunks). It seems open_mfdataset is the appropriate solution, but it seems I am not using it properly because it is very slow.

# Import modules                                                                                                                                             
import time
import numpy as np
import xarray as xr
from dask.distributed import Client

client = Client() 

# Define variables of interest                                                                                                                               
vars = ['nitrogendioxide_tropospheric_column_qafiltered']

# Read data                                                                                                                                                  
start = time.time()
dir = '/data_directory/'
ds = xr.open_mfdataset('{}/*2019*.nc'.format(dir), engine='netcdf4', combine='nested', concat_dim='time', parallel=True)
ds.close()
print(' | size(ds)/duration = {:0.2f}GB / {:0.2f}s'.format(ds.nbytes / 1e9,time.time()-start))

The time required to do this is : size(ds)/duration = 98.83GB / 1746.73s. Why so slow?

Note that if I don't put the client = Client() and parallel=True, it does not change significantly the time, so I am a bit confused.

NB : This test is performed on an interactive session in a HPC facilities :

>>> client  
<Client: 'tcp://127.0.0.1:43651' processes=4 threads=4, memory=33.78 GB>

NBbis : The xarray obtained is:

>>> ds
<xarray.Dataset>
Dimensions:                                                    (corner: 4, time: 363, x: 1028, x_b: 1029, y: 649, y_b: 650)
Coordinates:
    lat                                                        (y, x) float64 dask.array<chunksize=(649, 1028), meta=np.ndarray>
    lon                                                        (y, x) float64 dask.array<chunksize=(649, 1028), meta=np.ndarray>
  * time                                                       (time) datetime64[ns] 2019-01-01T05:00:00 ... 2019-12-31T05:00:00
Dimensions without coordinates: corner, x, x_b, y, y_b
Data variables:
    nitrogendioxide_tropospheric_column                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_tropospheric_column_precision              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    qa_value                                                   (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    latitude_bounds                                            (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    longitude_bounds                                           (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    solar_zenith_angle                                         (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    solar_azimuth_angle                                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_zenith_angle                                       (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_azimuth_angle                                      (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_fraction_crb                                         (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_altitude                                           (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_albedo                                             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_classification                                     (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_pressure                                           (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_radiance_fraction_nitrogendioxide_window             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column                       (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column_precision             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    degrees_of_freedom                                         (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    one                                                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    latitude_bounds_qafiltered                                 (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    longitude_bounds_qafiltered                                (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    nitrogendioxide_tropospheric_column_qafiltered             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_tropospheric_column_precision_qafiltered   (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    qa_value_qafiltered                                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_radiance_fraction_nitrogendioxide_window_qafiltered  (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_fraction_crb_qafiltered                              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_altitude_qafiltered                                (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_albedo_qafiltered                                  (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_classification_qafiltered                          (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_pressure_qafiltered                                (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    solar_zenith_angle_qafiltered                              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    solar_azimuth_angle_qafiltered                             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_zenith_angle_qafiltered                            (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_azimuth_angle_qafiltered                           (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column_qafiltered            (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column_precision_qafiltered  (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    degrees_of_freedom_qafiltered                              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    lat_b                                                      (time, y_b, x_b) float64 dask.array<chunksize=(1, 650, 1029), meta=np.ndarray>
    lon_b                                                      (time, y_b, x_b) float64 dask.array<chunksize=(1, 650, 1029), meta=np.ndarray>
Attributes:
    regrid_method:  conservative
    history:        read PRODUCT group...

I looked at the other posts but am not able to find an answer for this issue. Thanks for your help.

Upvotes: 2

Views: 4095

Answers (1)

Charles
Charles

Reputation: 127

See this github issue - there appear to be many people having problems with the performance of open_mfdataset, and no obvious solutions at the moment.

Upvotes: 2

Related Questions