bravery
bravery

Reputation: 123

How to groupby a custom time range in xarray?

I have a DataArray with date, x, and y coordinate dims.

date: 265, y: 1458, x: 1159

For each year, over the course of 35 years, there are about 10 arrays with a date and x and y dims.

I'd like to groupby a custom annual season to integrate values for each raster over that custom annual season (maybe October to April for each year). The xarray docs show that I can do something like this:

arr.groupby("date.season")

which results in

DataArrayGroupBy, grouped over 'season' 
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.

But this groups by season over the entire 35 year record and I can't control the start and ending months that are used to group.

Similarly this does not quite get at what I want:

all_eeflux_arr.groupby("date.year")
DataArrayGroupBy, grouped over 'year' 
36 groups with labels 1985, 1986, 1987, ..., 2019, 2020.

The start and the ending date for each year is automatically January/December.

I'd really like to be able to groupby an arbitrary period with a start time and an end time in units of day of year.

If this grouping can also discard dates that fall outside the grouping window, even better, since I may want to group by a day of year range that doesn't span all months.

The DataArray.resample method seems to be able to select a custom offset (see this SO post) for the start of the year, but I can't figure out how to access this with groupby. I can't use resample because it does not return a DataArray and I need to call the .integrate xarray method on each group DataArray (across the date dim, to get custom annual totals).

Data to reproduce (2Gb, but can be tested on a subset): https://ucsb.box.com/s/zhjxkqje18m61rivv1reig2ixiigohk0

Code to reproduce

import rioxarray as rio
import xarray as xr
import numpy as np
from pathlib import Path
from datetime import datetime

all_scenes_f_et = Path('/home/serdp/rhone/rhone-ecostress/rasters/eeflux/PDR')

all_pdr_et_paths = list(all_scenes_f_et.glob("*.tif"))

def eeflux_path_date(path):
    year, month, day, _ = path.name.split("_")
    return datetime(int(year), int(month), int(day))

def open_eeflux(path, da_for_match):
    data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
    data_array.rio.reproject_match(da_for_match)
    data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
    data_array["date"] = eeflux_path_date(path) # makes a new coordinate
    return data_array.expand_dims({"date":1}) # makes this coordinate a dimension

da_for_match = rio.open_rasterio(all_pdr_et_paths[0])
daily_eeflux_arrs = [open_eeflux(path, da_for_match) for path in all_pdr_et_paths]
all_eeflux_arr = xr.concat(daily_eeflux_arrs, dim="date")

all_eeflux_arr = all_eeflux_arr.sortby("date")

### not sure what should go here
all_eeflux_arr.groupby(????????).integrate(dim="date", datetime_unit="D")

Advice is much appreciated!

Upvotes: 2

Views: 1358

Answers (1)

bravery
bravery

Reputation: 123

I ended up writing a function that works well enough. Since my dataset isn't that large the integration doesn't take very long to run in a for loop that iterates over each group.

def group_by_custom_doy(all_eeflux_arr, doy_start, doy_end):
    ey = max(all_eeflux_arr['date.year'].values)
    sy = min(all_eeflux_arr['date.year'].values)
    start_years = range(sy,ey)
    end_years = range(sy+1, ey+1)
    start_end_years = list(zip(start_year,end_year))
    water_year_arrs = []
    for water_year in start_end_years:
        start_mask = ((all_eeflux_arr['date.dayofyear'].values > doy_start) & (all_eeflux_arr['date.year'].values == water_year[0]))
        end_mask = ((all_eeflux_arr['date.dayofyear'].values < doy_end) & (all_eeflux_arr['date.year'].values == water_year[1]))
        water_year_arrs.append(all_eeflux_arr[start_mask | end_mask])
    return water_year_arrs

water_year_arrs = group_by_custom_doy(all_eeflux_arr, 125, 300)

Upvotes: 1

Related Questions