Reputation: 351
As part of my ongoing quest to get my head around pandas I am confronted by a surprise series. I don't understand how and why the output is a series - I was expecting a dataframe. If someone could explain what is happening here it would be much appreciated.
ta, Andrew
Some data:
hash email date subject subject_length
0 65319af6e [email protected] 2020-11-28 REF-IntervalIndex._assert_can_do_setop-38112 44
1 0bf58d8a9 [email protected] 2020-11-28 DOC-add-contibutors-to-1.2.0-release-notes-38132 48
2 d16df293c [email protected] 2020-11-28 TYP-Add-cast-to-ABC-Index-like-types-38043 42
...
Some Code:
def my_function(row):
output = row['email'].value_counts().sort_values(ascending = False).head(3)
return output
top_three = dataframe.groupby(pd.Grouper(key='date', freq='1M')).apply(my_function)
Some Output:
date
2020-01-31 [email protected] 159
[email protected] 44
[email protected] 41
...
2020-10-31 [email protected] 170
[email protected] 23
[email protected] 21
2020-11-30 [email protected] 134
[email protected] 36
[email protected] 19
Name: email, dtype: int64
Upvotes: 0
Views: 54
Reputation: 19322
It depends on what your Groupby is returning.
In your case, you are applying a function on row['email']
and returning a single value_counts
, while all other columns in your data are part of index. A reset_index()
would therefore give you what you need. Meaning, you are returning a multi-index single column output after groupby, which will be returned as a Series instead of a DataFrame.
For more clarity on which data structure is returned, we can do a toy experiment.
For example, for the first case, the apply function is applying the lambda function on groups where each group contains a dataframe (check [i for i in df.groupby(['a'])]
to see what each group contains.
df = pd.DataFrame({'a':[1,1,2,2,3], 'b':[4,5,6,7,8]})
print(df.groupby(['a']).apply(lambda x:x**2))
#dataframe
a b
0 1 16
1 1 25
2 4 36
3 4 49
4 9 64
For the second case, we are only applying the lambda function on a series object OR only a single series is being returned. In this case, it doesn't return a dataframe and instead returns a series.
print(df.groupby(['a'])['b'].apply(lambda x:x**2))
#series
0 16
1 25
2 36
3 49
4 64
Name: b, dtype: int64
This can be solved simply by -
print(df.groupby(['a'])[['b']].apply(lambda x:x**2))
#dataframe
b
0 16
1 25
2 36
3 49
Upvotes: 1