Reputation: 77
I find the behavior of the groupby
method on a DataFrame object unexpected.
Let me explain with an example.
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
data1 = df['data1']
data1
# Out[14]:
# 0 1.989430
# 1 -0.250694
# 2 -0.448550
# 3 0.776318
# 4 -1.843558
# Name: data1, dtype: float64
data1
does not have the 'key1'
column anymore.
So I would expect to get an error if I applied the following operation:
grouped = data1.groupby(df['key1'])
But I don't, and I can further apply the mean
method on grouped
to get the expected result.
grouped.mean()
# Out[13]:
# key1
# a -0.034941
# b 0.163884
# Name: data1, dtype: float64
However, the above operation does create a group using the 'key1'
column of df
.
How can this happen? Does the interpreter store information of the originating DataFrame (df
in this case) with the created DataFrame/series (data1
in this case)?
Thank you.
Upvotes: 1
Views: 175
Reputation: 109546
Although the grouping columns are typically from the same dataframe or series, they don't have to be.
Your statement data1.groupby(df['key1'])
is equivalent to data1.groupby(['a', 'a', 'b', 'b', 'a'])
. In fact, you can inspect the actual groups:
>>> data1.groupby(['a', 'a', 'b', 'b', 'a']).groups
{'a': [0, 1, 4], 'b': [2, 3]}
This means that your groupby
on data1
will have a group a
using rows 0, 1, and 4 from data1
and a group b
using rows 2 and 3.
Upvotes: 0