Reputation: 20101
I have a data frame in this form:
value identifier
2007-01-01 0.781611 55
2007-01-01 0.766152 56
2007-01-01 0.766152 57
2007-02-01 0.705615 55
2007-02-01 0.032134 56
2007-02-01 0.032134 57
2008-01-01 0.026512 55
2008-01-01 0.993124 56
2008-01-01 0.993124 57
2008-02-01 0.226420 55
2008-02-01 0.033860 56
2008-02-01 0.033860 57
I can group the data by identifier using this answer.
by_date = df.groupby(df.index.date)['value'].mean()
2007-01-01 0.771305
2007-02-01 0.256628
2008-01-01 0.670920
2008-02-01 0.098047
Now I want to do a boxplot by month, so I would imagine that I can group by it:
new_df = pd.DataFrame()
new_df['value'] = by_date
by_month = by_date.groupby(by_date.index.month)
aa = by_month.groupby(lambda x: x.month)
aa.boxplot(subplots=False)
How can I create this boxplot without the dummy dataframe?
Upvotes: 1
Views: 1040
Reputation: 109546
When you did the groupby on date, you converted the index from a Timestamp to a datetime.date.
>>> type(df.index[0])
pandas.tslib.Timestamp
>>> type(by_date.index[0])
datetime.date
If you convert the index to Periods, you can groupby easily.
df.index = pd.DatetimeIndex(by_date.index).to_period('M')
>>> df.groupby(df.index).value.sum()
2007-01-01 2.313915
2007-02-01 0.769883
2008-01-01 2.012760
2008-02-01 0.294140
Name: value, dtype: float64
Upvotes: 1
Reputation: 394041
In order for the groupby to return a df instead of a Series then use double subsription [[]]
:
by_date = df.groupby(df.index.date)[['value']].mean()
this then allows you to groupby by month and generate a boxplot:
by_month = by_date.groupby(by_date.index.month)
by_month.boxplot(subplots=False)
The use of double subsription is a subtle feature which is not immediately obvious, generally doing df[col]
will return a column, but we know that passing a list of columns col_list
will return a df: df[col_list]
which when expanded is the same as df[[col_a, col_b]]
this then leads to the conclusion that we can return a df if we did the following: df[[col_a]]
as we've passed a list with a single element, this is not the same as df[col_a]
where we've passed a label to perform column indexing.
Upvotes: 2