Jakub Cieslar
Jakub Cieslar

Reputation: 65

Unexpected output format in pandas.groupby.apply

does someone know why pandas behave differently when column which we use as BY in GROUPBY contains only 1 unique value? Specifically, if there is just 1 value and we return pandas.Series, returned output is basically transposed in comparison to multiple unique values:

dt = pd.date_range('2021-01-01', '2021-01-02 23:00', closed=None, freq='1H')
df = pd.DataFrame({'date':dt.date, 'vals': range(dt.shape[0])}, index=dt)
dt1 = pd.date_range('2021-01-01', '2021-01-01 23:00', closed=None, freq='1H')
df2 = pd.DataFrame({'date':dt1.date, 'vals': range(dt1.shape[0])}, index=dt1)
def f(row, ):
    return row['vals']
print(df.groupby('date').apply(f).shape)
print(df2.groupby('date').apply(f).shape)
[out 1] (48,)
[out 2] (1, 24)

Is there some simple parameter I can use to make sure the behavior is consistent? Would it make sense to maybe sumbit it as bug-report due to inconsistency, or is it "expected" (I undestood from previous question that sometimes poor design or small part is not a bug)? (I still love pandas, just these small things can make their usage very painful)

Upvotes: 2

Views: 211

Answers (1)

tdy
tdy

Reputation: 41407

squeeze()

DataFrame.squeeze() and Series.squeeze() can make the shapes consistent:

>>> df.groupby('date').apply(f).squeeze().shape
(48,)

>>> df2.groupby('date').apply(f).squeeze().shape
(24,)

squeeze=True (deprecated)

groupby() has a squeeze param:

squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.

>>> df.groupby('date', squeeze=True).apply(f).shape
(48,)

>>> df2.groupby('date', squeeze=True).apply(f).shape
(24,)

This has been deprecated since pandas 1.1.0 and will be removed in the future.

Upvotes: 2

Related Questions