Group by returns empty dataframe and no error

Question

I try to groupby and agg but I receive an empty dataframe and no error.

When I do this:

  df_temp = df.groupby('Col1')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive the dataframe aggregated as expected.

When I do this:

  df_temp = df.groupby('Col1', 'Col2')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive the dataframe aggregated as expected.

When I do this:

  df_temp = df.groupby('Col1', 'Col2', 'Col3')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive the dataframe aggregated as expected.

But when I do this:

  df_temp = df.groupby('Col1', 'Col2', 'Col3', 'Col4')['InfoType', 'InfoLabel1', 'InfoLabel2'].agg(lambda x: ', '.join(x))

then I receive an empty dataframe and no error.

However, I do not think that the problem is Col4 because when I remove Col2 and I still keep Col4 then I receive the dataframe aggregated as expected.

Why this is happening?

'Col1', 'Col2', 'Col3', 'Col4' are of different types but I do not think that this is the problem because for example also Col1', 'Col2', 'Col3' are of different types but the aggregation works when I group by only on these.

Can it be related to NAs in these columns?

P.S.

I know that it would better to have specific examples of my data but it would be too time-consuming to post them here and also I do not want to expose my data at all.

P.S.2

I did the following. Before the groupby, I filled in the np.nan with values (eg -1 for floats and 'NA' for objects) and the code worked so I was probably right at my initial hypothesis about the NAs. Feel free to share ideas why this is happening.

Stef · Accepted Answer

The reason is that in all groups created by all 4 columns is at least one NA value. Therefore these groups are excluded and the result is empty. If you take less than 4 columns this condition is obviously not met for your actual data.

See the docs on missing values:

NA groups in GroupBy are automatically excluded.

Example:

>>> df = pd.DataFrame({'a':[None,1,2], 'b':[1,None,2], 'c': [1,2,None], 'd': [1,1,1]})
>>> df
     a    b    c  d
0  NaN  1.0  1.0  1
1  1.0  NaN  2.0  1
2  2.0  2.0  NaN  1
>>> df.groupby(['a', 'b']).d.sum()
a    b  
2.0  2.0    1
Name: d, dtype: int64
>>> df.groupby(['a', 'c']).d.sum()
a    c  
1.0  2.0    1
Name: d, dtype: int64
>>> df.groupby(['b', 'c']).d.sum()
b    c  
1.0  1.0    1
Name: d, dtype: int64
>>> df.groupby(['a', 'b', 'c']).d.sum()
Series([], Name: d, dtype: int64)

Version 1.1.0 will have a dropna parameter in groupby to handle this kind of cases. You can set it to False to include NA values in groupby keys (default is True for backward compability), see https://github.com/pandas-dev/pandas/pull/30584.

Group by returns empty dataframe and no error

Answers (1)

Related Questions