Xarray reads data in file as coordinates, apparently indexing coordinates, how do I convert the actual data from coordinates to data variables?

Question

I am working with the file at:

https://satdat.ngdc.noaa.gov/sem/poes/data/processed/ngdc/uncorrected/full/2013/metop01/poes_m01_20130525_proc.nc

when I read it in using xarray,

ds = xr.open_dataset('poes_m01_20130525_proc.nc')

all of the variables are read in as coordinates, with at least some of them as indexing coordinates. I only know the last bit, because when I try to convert them to variables using,

ds.reset_coords()

I get the error,

ValueError: cannot remove index coordinates with reset_coords.  The error appears to include all of the variables (there is a very long list).

I can convert all of the coordinate variables into a numpy array and rebuild a new Dataset manually. However, I am very new to xarray. Is there a more elegant way to do this? For instance, can I convert the indexing coordinates to non-indexing coordinates and then use reset_coords? Also, how do I tell which coordinates are indexing coordinates and which are not?

Or, better, is there some option that I should be using when reading the file that I don't know to use. I don't recognize anything in the documentation that would suggest this, but there is a lot in the documentation that I don't understand.

Thanks for any help!

OriolAbril · Accepted Answer

As you have guessed, to be able to convert a coordinate to a data variable, it must be a non indexing coordinate. You'll recognize indexing coordinates because they have a * right before them when coordinates are listed. In your example, it looks like every single variable is assumed to be it's own coordinate (no idea why, I'm not a NetCDF expert).

To convert an indexing coordinate into a non indexing coordinate, you can use reset_index, which requires specifying which index are to be reset. I took the liberty of assuming the first handful of coordinates in your dataset are correctly set as coordinates and the rest should be data variables. In this case, the following code could solve the problem:

var_names = list(
    set(ds.dims) - 
    {"time", "year", "day", "msec", "satID", "sat_direction", "alt", "lat", "lon"}
)
clean_ds = ds.reset_index(var_names).reset_coords()

This leaves us with a not very useful dataset though. reset_index has added a _ at the end of each variable name (to distinguish the non indexing coordinate from the dimension with the same name). You'll probably want to do something similar to what is done in this other answer: Xarray: Make two DataArrays in the same Dataset use the same coordinate system

Some ideas:

Get all variables to have time as dimension

coord_names = ["time", "year", "day", "msec", "satID", "sat_direction", "alt", "lat", "lon"]
clean_ds = clean_ds.reset_index(coord_names + [])
clean_ds = clean_ds.rename({name: "time_" for name in clean_ds.dims})

Then, rename variables and coords (and time_ dim) to remove trailing underscore in name:

clean_ds.rename({f"{name}_": name for name in var_names+coord_names})

If we had used rename_vars the dimension time_ would not have been renamed, it could be renamed afterwards to keep time coord and dim different.

After all renaming and restructuring, attributes from the original Dataset can be added again to clean_ds

for var_name in ds.coords:
    clean_ds[var_name] = clean_ds[var_name].assign_attrs(ds[var_name].attrs)

Xarray reads data in file as coordinates, apparently indexing coordinates, how do I convert the actual data from coordinates to data variables?

Answers (1)

Related Questions