Reputation: 79
In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1]
to return groups of [1,1,1,1]
, [2,2,]
,[1,1,1,1]
. To do this, I use the syntax:
datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())
...where I would obtain individual groups by df[key][variable] == some condition
. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2
, I would end up with [1,1,1,1]
,[1,1,1,1]
from the above example.
I am attempting to do the same thing in xarray
package, because I am working with multidimensional data, but the above syntax obviously doesn't work.
What I have been successful doing so far:
datagroups_notsplit = df[key].where(df[key][variable] == some condition)
So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1]
(if some condition was x <2
). The question is, how do I cut these groups so that it becomes [1,1,1,1]
,[1,1,1,1]
?
datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])
But then, following the example above, I end up with groups [1,1,1,1,1,1,1]
, [2,2]
. Is there a way to then groupby the groups on noncontiguous index values?
Upvotes: 1
Views: 512
Reputation: 79
My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:
(1) create indexnumber variable:
df = Dataset( data_vars={ 'some_data' : (('date'), some_data), 'more_data' : (('date'), more_data), 'indexnumber' : (('date'), arange(0,len(date_arr)) }, coords={ 'date' : date_arr } )
(2) get the indices for the groupby groups:
ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes
(3) get the cumsum field:
sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()
(4) reconstitute a new df:
df2 = df.loc[ind_slice]
(5) add the cumsum field:
df2['sumcum'] = sumcum
(6) groupby:
groups = df2.groupby(df['sumcum'])
hope this helps anyone else out there looking to do this.
Upvotes: 0
Reputation: 33940
Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:
import pandas as pd
import xarray as xr
dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')
# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]
# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')
# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])
dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.
The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.
Upvotes: 1