Glynbeard
Glynbeard

Reputation: 1389

Reshaping and combining data from netCDF in Python

I'm currently reading in a netCDF file using xarray in Python with a variety of 3 hourly temperature (t2m) data. The format of the data is (time: 2920, latitude: 189, longitude: 521) or (2920, 189, 521) which represents a year of data. I have 30 of these files 2GB each.

longitude (longitude) float32         -170.0 -169.8 ... -40.25 -40.0
latitude  (latitude)  float32         82.0 81.75 81.5 ... 35.5 35.25 35.0
time      (time)      datetime64[ns]  1979-01-01T01:00:00 ... 1979-12-...

I would like to reshape this data into a format which I can feed into scikit-learn's

sklearn.model_selection.train_test_split

i.e. I would like to generate the following DataFrame for each file/year:

index   time                  lat   lon       t2m
0       1979-01-01T00:00:00   35    -170      270
1       1979-01-01T00:00:00   35    -169.75   269
2       1979-01-01T00:00:00   35    -169.5    271
...
n-1     1979-12-31T21:00:00   82    -40.25    241
n       1979-12-31T21:00:00   82    -40       244

Note that we would have 521 lat=35 rows before moving onto the next latitude value. After we get through all 189 latitude values we then go to the next timestep and repeat until finished.

I assume there is a way to achieve what I want with some combination of melting and reshaping of the xarray ds but I've yet to find anything that works. Any advice would be appreciated.

Upvotes: 0

Views: 167

Answers (1)

Robert Wilson
Robert Wilson

Reputation: 3417

This should be achievable with xarray's built in methods, as shown below. There are possibly more commands here than you need. One thing to be careful about when converting xarray datasets to dataframes is if coordinates have "bounds" it can duplicate values, but the code below should deal with that.

df = (ds
      # convert to dataframe
      .to_dataframe()
      # convert time and lon/lat to columns
      .reset_index()
      # only select what you want, in case there are bnds etc. in the data
      .loc[:,["time", "lon", "lat", "t2m"]]
      # remove duplicates that could be introduced by bnds
      .drop_duplicates()
      # add an index
      .reset_index()
      )

Upvotes: 1

Related Questions