Reputation: 123
I have a DataArray with date, x, and y coordinate dims.
date: 265, y: 1458, x: 1159
For each year, over the course of 35 years, there are about 10 arrays with a date and x and y dims.
I'd like to groupby a custom annual season to integrate values for each raster over that custom annual season (maybe October to April for each year). The xarray docs show that I can do something like this:
arr.groupby("date.season")
which results in
DataArrayGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
But this groups by season over the entire 35 year record and I can't control the start and ending months that are used to group.
Similarly this does not quite get at what I want:
all_eeflux_arr.groupby("date.year")
DataArrayGroupBy, grouped over 'year'
36 groups with labels 1985, 1986, 1987, ..., 2019, 2020.
The start and the ending date for each year is automatically January/December.
If this grouping can also discard dates that fall outside the grouping window, even better, since I may want to group by a day of year range that doesn't span all months.
The DataArray.resample method seems to be able to select a custom offset (see this SO post) for the start of the year, but I can't figure out how to access this with groupby. I can't use resample because it does not return a DataArray and I need to call the .integrate xarray method on each group DataArray (across the date dim, to get custom annual totals).
Data to reproduce (2Gb, but can be tested on a subset): https://ucsb.box.com/s/zhjxkqje18m61rivv1reig2ixiigohk0
Code to reproduce
import rioxarray as rio
import xarray as xr
import numpy as np
from pathlib import Path
from datetime import datetime
all_scenes_f_et = Path('/home/serdp/rhone/rhone-ecostress/rasters/eeflux/PDR')
all_pdr_et_paths = list(all_scenes_f_et.glob("*.tif"))
def eeflux_path_date(path):
year, month, day, _ = path.name.split("_")
return datetime(int(year), int(month), int(day))
def open_eeflux(path, da_for_match):
data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
data_array.rio.reproject_match(da_for_match)
data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
data_array["date"] = eeflux_path_date(path) # makes a new coordinate
return data_array.expand_dims({"date":1}) # makes this coordinate a dimension
da_for_match = rio.open_rasterio(all_pdr_et_paths[0])
daily_eeflux_arrs = [open_eeflux(path, da_for_match) for path in all_pdr_et_paths]
all_eeflux_arr = xr.concat(daily_eeflux_arrs, dim="date")
all_eeflux_arr = all_eeflux_arr.sortby("date")
### not sure what should go here
all_eeflux_arr.groupby(????????).integrate(dim="date", datetime_unit="D")
Advice is much appreciated!
Upvotes: 2
Views: 1358
Reputation: 123
I ended up writing a function that works well enough. Since my dataset isn't that large the integration doesn't take very long to run in a for loop that iterates over each group.
def group_by_custom_doy(all_eeflux_arr, doy_start, doy_end):
ey = max(all_eeflux_arr['date.year'].values)
sy = min(all_eeflux_arr['date.year'].values)
start_years = range(sy,ey)
end_years = range(sy+1, ey+1)
start_end_years = list(zip(start_year,end_year))
water_year_arrs = []
for water_year in start_end_years:
start_mask = ((all_eeflux_arr['date.dayofyear'].values > doy_start) & (all_eeflux_arr['date.year'].values == water_year[0]))
end_mask = ((all_eeflux_arr['date.dayofyear'].values < doy_end) & (all_eeflux_arr['date.year'].values == water_year[1]))
water_year_arrs.append(all_eeflux_arr[start_mask | end_mask])
return water_year_arrs
water_year_arrs = group_by_custom_doy(all_eeflux_arr, 125, 300)
Upvotes: 1