Reputation: 65
does someone know why pandas behave differently when column which we use as BY in GROUPBY contains only 1 unique value? Specifically, if there is just 1 value and we return pandas.Series, returned output is basically transposed in comparison to multiple unique values:
dt = pd.date_range('2021-01-01', '2021-01-02 23:00', closed=None, freq='1H')
df = pd.DataFrame({'date':dt.date, 'vals': range(dt.shape[0])}, index=dt)
dt1 = pd.date_range('2021-01-01', '2021-01-01 23:00', closed=None, freq='1H')
df2 = pd.DataFrame({'date':dt1.date, 'vals': range(dt1.shape[0])}, index=dt1)
def f(row, ):
return row['vals']
print(df.groupby('date').apply(f).shape)
print(df2.groupby('date').apply(f).shape)
[out 1] (48,)
[out 2] (1, 24)
Is there some simple parameter I can use to make sure the behavior is consistent? Would it make sense to maybe sumbit it as bug-report due to inconsistency, or is it "expected" (I undestood from previous question that sometimes poor design or small part is not a bug)? (I still love pandas, just these small things can make their usage very painful)
Upvotes: 2
Views: 211
Reputation: 41407
squeeze()
DataFrame.squeeze()
and Series.squeeze()
can make the shapes consistent:
>>> df.groupby('date').apply(f).squeeze().shape
(48,)
>>> df2.groupby('date').apply(f).squeeze().shape
(24,)
squeeze=True
(deprecated)groupby()
has a squeeze
param:
squeeze
: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
>>> df.groupby('date', squeeze=True).apply(f).shape
(48,)
>>> df2.groupby('date', squeeze=True).apply(f).shape
(24,)
This has been deprecated since pandas 1.1.0 and will be removed in the future.
Upvotes: 2