alextc
alextc

Reputation: 3515

How to use the DataArray where() function to assign value from another DataArray based on conditions

I am working with xarray to create a new Dataset based on the conditions of values from another Dataset.

The input Dataset object ds_season is by seasons and has three dimensions as below.

    <xarray.Dataset>
    Dimensions:               (latitude: 106, longitude: 193, season: 4)
    Coordinates:
      * latitude              (latitude) float32 -39.2 -39.149525 ... -33.9
      * longitude             (longitude) float32 140.8 140.84792 ... 150.0
      * season                (season) object 'DJF' 'JJA' 'MAM' 'SON'
    Data variables:
        FFDI 95TH PERCENTILE  (season, latitude, longitude) float64 dask.array<shape=(4, 106, 193), chunksize=(4, 106, 193)>

I need to create a new Dataset which has three dimensions latitude, longitude and time. The latitude and longitude should have the same coordinates as the input Dataset, and the time coordinates should be days over 10 years.

For example, the resulting Dataset is like:

<xarray.Dataset>
Dimensions:    (latitude: 106, longitude: 193, time: 3653)
Coordinates:
  * latitude   (latitude) float32 -39.2 -39.149525 ... -33.950478 -33.9
  * longitude  (longitude) float32 140.8 140.84792 140.89584 ... 149.95209 150.0
  * time       (time) datetime64[ns] 1972-01-01T00:00:00 1972-01-02T00:00:00 1972-01-03T00:00:00 ... 1981-12-30T00:00:00 1981-12-31T00:00:00
Data variables:
    FFDI 95TH PERCENTILE  (time, latitude, longitude) float64 dask.array<shape=(3653, 106, 193), chunksize=(3653, 106, 193)>

The variable for a day should be the same as the variable for the season that the day falls in. This means, 1972-01-01, 1972-02-02 and 1972-02-28 should have the same value as the season DJF has; and 1972-04-01, 1972-05-02 and 1972-05-31 should have the same value as the season MAM has.

I am thinking of the Dataset's where() function but have no clue where to start with. http://xarray.pydata.org/en/stable/generated/xarray.Dataset.where.html?highlight=where#xarray.Dataset.where

Upvotes: 2

Views: 2339

Answers (2)

Ryan
Ryan

Reputation: 806

I agree with Andrea that creating a dataset with 3653 unique days which only replicates 4 different seasonal values is in general inefficient. If you give more information about your broader goals for doing this, perhaps we can suggest an alternative solution.

Assuming you really do want to do this, the quickest way is probably to use xarray's groupby broadcasting arithmetic. In what follows, I will assume that ds is the name of the second dataset in your original post (the one with dimensions (latitude: 106, longitude: 193, time: 3653)). Then you can do it very quickly like

zeros = xr.zeros_like(ds)
filled_in = zeros.groupby('time.season') + ds_season

This suggestion is inspired by the way we usually compute the anomaly from a seasonal climatology:

# original dataset with dimensions 'time'
ds = xr.open_dataset(...)
# climatology has dimension 'season'
ds_climatology = ds.groubpy('time.season').mean(dim='time') 
# anomaly has dimension 'time'
ds_anomaly = ds.groubpy('time.season') - ds_climatology

Upvotes: 0

Andrea Massetti
Andrea Massetti

Reputation: 115

First, a note. Creating a new DataArray copying the same identical spatial data on each day for 3 months can occupy a lot of disc space without making much sense. I would rather query the season DataArray every time that you need the data for a specific day. However, if you really need to do this operation, and to answer your question, I think that the most straightforward way to do this is to:

  1. first, create a new container; a np.ndarray is a good and neat idea.
  2. Then, build the date index,
  3. query your original season DataArray,
  4. and finally, create a new DataArray with dimension time.

In the following example, I created a season DataArray for testing. If I understood exactly your problem, you should be able to use your original array without changing much in the second part (with the creation of foo).

Let's get into it. First the imports:

import xarray as xr
import numpy as np
import pandas as pd

Create an empty container of the needed size.

data_s = np.zeros((4, 10, 10))

Fill it with dummy values.

data_s[0] = 0.5
data_s[1] = 0.9
data_s[2] = 0.8
data_s[3] = 0.45

Create dummy coordinates.

x = y = np.arange(10)

Create the season index.

seasons = ["spring", "summer", "autumn", "winter"]

Finally, create the DataArray.

bar = xr.DataArray(data_s, coords=[seasons, x, y], dims=['season', 'x', 'y'])

bar is the DataArray that you want to extract seasonal values from. Now repeat the same for the single dates.

Create a container array for 2000 days, that we will fill with each season's data.

data = np.ones((2000, 10, 10))
x = y = np.arange(10)
dates = pd.date_range('2000-01-01', periods=2000)

Here I assume boreal seasons starting on the first of the month (borrowed from here. Of course, you can easily write a better function, for example using the day of the year to get the season.

season = np.array((dates.month %12 + 3)//3)

Create a dictionary to convert the number above to the season's string previously assigned in bar

seas_to_num = {1:"spring", 2:"summer", 3:"autumn", 4:"winter"}

We fill the array for each day with the values found on bar[season].

for date, seas in enumerate(season):
    data[date] = bar.sel(season=seas_to_num[seas])

Finally, we create the DataArray.

foo = xr.DataArray(data, coords=[dates, x, y], dims=['time', 'x', 'y'])

Now selecting the 5th of April, we get the value for spring.

In [1]: foo.sel(time=pd.to_datetime("5/4/2001"))
Out[1]: 
array([[0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9],
   [0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9]])
Coordinates:
time     datetime64[ns] 2001-05-03
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y        (y) int32 0 1 2 3 4 5 6 7 8 9

Upvotes: 0

Related Questions