Andrew Holway
Andrew Holway

Reputation: 351

Pandas Confusion - My apply() is returning a series and I don't understand why

As part of my ongoing quest to get my head around pandas I am confronted by a surprise series. I don't understand how and why the output is a series - I was expecting a dataframe. If someone could explain what is happening here it would be much appreciated.

ta, Andrew

Some data:

        hash                                         email       date                                           subject  subject_length
0  65319af6e                        [email protected] 2020-11-28      REF-IntervalIndex._assert_can_do_setop-38112              44
1  0bf58d8a9                     [email protected] 2020-11-28  DOC-add-contibutors-to-1.2.0-release-notes-38132              48
2  d16df293c  [email protected] 2020-11-28        TYP-Add-cast-to-ABC-Index-like-types-38043              42
...

Some Code:

def my_function(row):
    output = row['email'].value_counts().sort_values(ascending = False).head(3)
    return output

top_three = dataframe.groupby(pd.Grouper(key='date', freq='1M')).apply(my_function)

Some Output:

date                                                         
2020-01-31  [email protected]                               159
            [email protected]     44
            [email protected]                41
...
2020-10-31  [email protected]                               170
            [email protected]              23
            [email protected]               21
2020-11-30  [email protected]                               134
            [email protected]               36
            [email protected]            19
Name: email, dtype: int64

Upvotes: 0

Views: 54

Answers (1)

Akshay Sehgal
Akshay Sehgal

Reputation: 19322

It depends on what your Groupby is returning.

In your case, you are applying a function on row['email'] and returning a single value_counts, while all other columns in your data are part of index. A reset_index() would therefore give you what you need. Meaning, you are returning a multi-index single column output after groupby, which will be returned as a Series instead of a DataFrame.


For more clarity on which data structure is returned, we can do a toy experiment.

For example, for the first case, the apply function is applying the lambda function on groups where each group contains a dataframe (check [i for i in df.groupby(['a'])] to see what each group contains.

df = pd.DataFrame({'a':[1,1,2,2,3],  'b':[4,5,6,7,8]})
print(df.groupby(['a']).apply(lambda x:x**2))
#dataframe
   a   b
0  1  16
1  1  25
2  4  36
3  4  49
4  9  64

For the second case, we are only applying the lambda function on a series object OR only a single series is being returned. In this case, it doesn't return a dataframe and instead returns a series.

print(df.groupby(['a'])['b'].apply(lambda x:x**2))
#series
0    16
1    25
2    36
3    49
4    64
Name: b, dtype: int64

This can be solved simply by -

print(df.groupby(['a'])[['b']].apply(lambda x:x**2))
#dataframe
    b
0  16
1  25
2  36
3  49

Upvotes: 1

Related Questions