EJSABOLK
EJSABOLK

Reputation: 79

Xarray: grouping by contiguous identical values

In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1] to return groups of [1,1,1,1], [2,2,],[1,1,1,1]. To do this, I use the syntax:

 datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())

...where I would obtain individual groups by df[key][variable] == some condition. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2, I would end up with [1,1,1,1],[1,1,1,1] from the above example.

I am attempting to do the same thing in xarray package, because I am working with multidimensional data, but the above syntax obviously doesn't work.

What I have been successful doing so far:

a) apply some condition to separate the values I want by NaNs:

 datagroups_notsplit = df[key].where(df[key][variable] == some condition)

So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1] (if some condition was x <2). The question is, how do I cut these groups so that it becomes [1,1,1,1],[1,1,1,1]?

b) Alternatively, group by some condition...

 datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])

But then, following the example above, I end up with groups [1,1,1,1,1,1,1], [2,2]. Is there a way to then groupby the groups on noncontiguous index values?

Upvotes: 1

Views: 512

Answers (2)

EJSABOLK
EJSABOLK

Reputation: 79

My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:

(1) create indexnumber variable:

df = Dataset( data_vars={ 'some_data' : (('date'), some_data), 'more_data' : (('date'), more_data), 'indexnumber' : (('date'), arange(0,len(date_arr)) }, coords={ 'date' : date_arr } )

(2) get the indices for the groupby groups:

ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes

(3) get the cumsum field:

sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()

(4) reconstitute a new df:

df2 = df.loc[ind_slice]

(5) add the cumsum field:

df2['sumcum'] = sumcum

(6) groupby:

groups = df2.groupby(df['sumcum'])

hope this helps anyone else out there looking to do this.

Upvotes: 0

smci
smci

Reputation: 33940

Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:

import pandas as pd
import xarray as xr

dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')

# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]

# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')

# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])

dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.

The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.

Upvotes: 1

Related Questions