Reputation: 21
I have a directory of 363 netcdf files corresponding to different times, (all files have a similar internal structure, with a "time" dimension of 1), 270MB each, for a total of about 100GB. I want to upload all these data in a single xarray (with dask arrays and chunks). It seems open_mfdataset is the appropriate solution, but it seems I am not using it properly because it is very slow.
# Import modules
import time
import numpy as np
import xarray as xr
from dask.distributed import Client
client = Client()
# Define variables of interest
vars = ['nitrogendioxide_tropospheric_column_qafiltered']
# Read data
start = time.time()
dir = '/data_directory/'
ds = xr.open_mfdataset('{}/*2019*.nc'.format(dir), engine='netcdf4', combine='nested', concat_dim='time', parallel=True)
ds.close()
print(' | size(ds)/duration = {:0.2f}GB / {:0.2f}s'.format(ds.nbytes / 1e9,time.time()-start))
The time required to do this is : size(ds)/duration = 98.83GB / 1746.73s
. Why so slow?
Note that if I don't put the client = Client()
and parallel=True
, it does not change significantly the time, so I am a bit confused.
NB : This test is performed on an interactive session in a HPC facilities :
>>> client
<Client: 'tcp://127.0.0.1:43651' processes=4 threads=4, memory=33.78 GB>
NBbis : The xarray obtained is:
>>> ds
<xarray.Dataset>
Dimensions: (corner: 4, time: 363, x: 1028, x_b: 1029, y: 649, y_b: 650)
Coordinates:
lat (y, x) float64 dask.array<chunksize=(649, 1028), meta=np.ndarray>
lon (y, x) float64 dask.array<chunksize=(649, 1028), meta=np.ndarray>
* time (time) datetime64[ns] 2019-01-01T05:00:00 ... 2019-12-31T05:00:00
Dimensions without coordinates: corner, x, x_b, y, y_b
Data variables:
nitrogendioxide_tropospheric_column (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
nitrogendioxide_tropospheric_column_precision (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
qa_value (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
latitude_bounds (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
longitude_bounds (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
solar_zenith_angle (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
solar_azimuth_angle (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
viewing_zenith_angle (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
viewing_azimuth_angle (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
cloud_fraction_crb (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_altitude (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_albedo (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_classification (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_pressure (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
cloud_radiance_fraction_nitrogendioxide_window (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
nitrogendioxide_stratospheric_column (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
nitrogendioxide_stratospheric_column_precision (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
degrees_of_freedom (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
one (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
latitude_bounds_qafiltered (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
longitude_bounds_qafiltered (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
nitrogendioxide_tropospheric_column_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
nitrogendioxide_tropospheric_column_precision_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
qa_value_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
cloud_radiance_fraction_nitrogendioxide_window_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
cloud_fraction_crb_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_altitude_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_albedo_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_classification_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
surface_pressure_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
solar_zenith_angle_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
solar_azimuth_angle_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
viewing_zenith_angle_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
viewing_azimuth_angle_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
nitrogendioxide_stratospheric_column_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
nitrogendioxide_stratospheric_column_precision_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
degrees_of_freedom_qafiltered (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
lat_b (time, y_b, x_b) float64 dask.array<chunksize=(1, 650, 1029), meta=np.ndarray>
lon_b (time, y_b, x_b) float64 dask.array<chunksize=(1, 650, 1029), meta=np.ndarray>
Attributes:
regrid_method: conservative
history: read PRODUCT group...
I looked at the other posts but am not able to find an answer for this issue. Thanks for your help.
Upvotes: 2
Views: 4095
Reputation: 127
See this github issue - there appear to be many people having problems with the performance of open_mfdataset, and no obvious solutions at the moment.
Upvotes: 2