Reputation: 5117
I try to groupby
and agg
but I receive an empty dataframe and no error.
When I do this:
df_temp = df.groupby('Col1')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))
then I receive the dataframe aggregated as expected.
When I do this:
df_temp = df.groupby('Col1', 'Col2')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))
then I receive the dataframe aggregated as expected.
When I do this:
df_temp = df.groupby('Col1', 'Col2', 'Col3')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))
then I receive the dataframe aggregated as expected.
But when I do this:
df_temp = df.groupby('Col1', 'Col2', 'Col3', 'Col4')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))
then I receive an empty dataframe and no error.
However, I do not think that the problem is Col4
because when I remove Col2
and I still keep Col4
then I receive the dataframe aggregated as expected.
Why this is happening?
'Col1', 'Col2', 'Col3', 'Col4' are of different types but I do not think that this is the problem because for example also Col1', 'Col2', 'Col3' are of different types but the aggregation works when I group by only on these.
Can it be related to NAs in these columns?
P.S.
I know that it would better to have specific examples of my data but it would be too time-consuming to post them here and also I do not want to expose my data at all.
P.S.2
I did the following. Before the groupby
, I filled in the np.nan
with values (eg -1 for floats and 'NA' for objects) and the code worked so I was probably right at my initial hypothesis about the NAs. Feel free to share ideas why this is happening.
Upvotes: 6
Views: 14729
Reputation: 30579
The reason is that in all groups created by all 4 columns is at least one NA
value. Therefore these groups are excluded and the result is empty. If you take less than 4 columns this condition is obviously not met for your actual data.
See the docs on missing values:
NA groups in GroupBy are automatically excluded.
Example:
>>> df = pd.DataFrame({'a':[None,1,2], 'b':[1,None,2], 'c': [1,2,None], 'd': [1,1,1]})
>>> df
a b c d
0 NaN 1.0 1.0 1
1 1.0 NaN 2.0 1
2 2.0 2.0 NaN 1
>>> df.groupby(['a', 'b']).d.sum()
a b
2.0 2.0 1
Name: d, dtype: int64
>>> df.groupby(['a', 'c']).d.sum()
a c
1.0 2.0 1
Name: d, dtype: int64
>>> df.groupby(['b', 'c']).d.sum()
b c
1.0 1.0 1
Name: d, dtype: int64
>>> df.groupby(['a', 'b', 'c']).d.sum()
Series([], Name: d, dtype: int64)
Version 1.1.0 will have a dropna
parameter in groupby
to handle this kind of cases. You can set it to False
to include NA
values in groupby keys (default is True
for backward compability), see https://github.com/pandas-dev/pandas/pull/30584.
Upvotes: 11