Reputation: 1389
I'm currently reading in a netCDF file using xarray in Python with a variety of 3 hourly temperature (t2m) data. The format of the data is (time: 2920, latitude: 189, longitude: 521) or (2920, 189, 521) which represents a year of data. I have 30 of these files 2GB each.
longitude (longitude) float32 -170.0 -169.8 ... -40.25 -40.0
latitude (latitude) float32 82.0 81.75 81.5 ... 35.5 35.25 35.0
time (time) datetime64[ns] 1979-01-01T01:00:00 ... 1979-12-...
I would like to reshape this data into a format which I can feed into scikit-learn's
sklearn.model_selection.train_test_split
i.e. I would like to generate the following DataFrame for each file/year:
index time lat lon t2m
0 1979-01-01T00:00:00 35 -170 270
1 1979-01-01T00:00:00 35 -169.75 269
2 1979-01-01T00:00:00 35 -169.5 271
...
n-1 1979-12-31T21:00:00 82 -40.25 241
n 1979-12-31T21:00:00 82 -40 244
Note that we would have 521 lat=35 rows before moving onto the next latitude value. After we get through all 189 latitude values we then go to the next timestep and repeat until finished.
I assume there is a way to achieve what I want with some combination of melting and reshaping of the xarray ds but I've yet to find anything that works. Any advice would be appreciated.
Upvotes: 0
Views: 167
Reputation: 3417
This should be achievable with xarray's built in methods, as shown below. There are possibly more commands here than you need. One thing to be careful about when converting xarray datasets to dataframes is if coordinates have "bounds" it can duplicate values, but the code below should deal with that.
df = (ds
# convert to dataframe
.to_dataframe()
# convert time and lon/lat to columns
.reset_index()
# only select what you want, in case there are bnds etc. in the data
.loc[:,["time", "lon", "lat", "t2m"]]
# remove duplicates that could be introduced by bnds
.drop_duplicates()
# add an index
.reset_index()
)
Upvotes: 1