Greg Madman
Greg Madman

Reputation: 59

xarray open_mfdataset does not return arrays with Numpy data

I am currently trying to open multiple netCDF files. They all have same main dimension (which is just the number of rows), and multiple variables : time, platform_code, and other ones.

Here is the code I use for trying to concatenate all the datas:

ds_disk_merged = xarray.open_mfdataset([path1, path2, path3, path4], concat_dim="row", combine='nested')

When I take rows, everything is alright: I find my numpy array, concatenated as expected:

In [5]: ds_disk_merged.row.data
Out[5]: array([     0,      1,      2, ..., 968041, 968042, 968043])

But when I take one of my variables, nothing is accessible:

In [6]: ds_disk_merged.time.data
Out[6]: dask.array<concatenate, shape=(968044,), dtype=datetime64[ns], chunksize=(253158,), chunktype=numpy.ndarray>

Do you know how to have all variables datas concatenated, following the same process as my rows do ?

As information, the number of rows for my files (path by path) are like this:

In [7]: all_nc_files_number_of_rows 
Out[7]: [249499, 232995, 232392,253158]

Upvotes: 1

Views: 479

Answers (1)

Michael Delgado
Michael Delgado

Reputation: 15432

What you have is a dask.array, which is a chunked, scheduled (but not in-memory) set of multiple numpy arrays. In addition to providing a labeled indexing interface to arrays, xarray has the ability to work with multiple backends, which form the computational engine underlying the array operations on the .data attribute. When you use xr.open_mfdataset the result will always be a chunked dask array. See the xarray docs on Parallel Computing with Dask for more info.

You can just convert to numpy with ds_disk_merged = ds_disk_merged.compute(). Note that the work of reading in the netCDF data will not occur until you trigger a computation like this - until then dask will only schedule the operation. Because of this, you may only run into issues such as read errors, memory bottlenecks, or other workflow issues when executing the job rather than with the line of code that actually is causing the problem. See the dask docs on lazy execution for an intro to this concept.

For starters, check the size of the array with ds_disk_merged[variable_name].data.nbytes and make sure you can fit it comfortably in memory before calling compute().

Upvotes: 3

Related Questions