Aggregate by repeated datetime index with different identifiers in a column on a pandas dataframe

Question

I have a data frame in this form:

         value     identifier
2007-01-01  0.781611      55
2007-01-01  0.766152      56
2007-01-01  0.766152      57
2007-02-01  0.705615      55
2007-02-01  0.032134      56
2007-02-01  0.032134      57
2008-01-01  0.026512      55
2008-01-01  0.993124      56
2008-01-01  0.993124      57
2008-02-01  0.226420      55
2008-02-01  0.033860      56
2008-02-01  0.033860      57

I can group the data by identifier using this answer.

by_date = df.groupby(df.index.date)['value'].mean()
2007-01-01    0.771305
2007-02-01    0.256628
2008-01-01    0.670920
2008-02-01    0.098047

Now I want to do a boxplot by month, so I would imagine that I can group by it:

new_df = pd.DataFrame()
new_df['value'] = by_date
by_month = by_date.groupby(by_date.index.month)
aa = by_month.groupby(lambda x: x.month)
aa.boxplot(subplots=False)

How can I create this boxplot without the dummy dataframe?

EdChum · Accepted Answer

In order for the groupby to return a df instead of a Series then use double subsription [[]]:

by_date = df.groupby(df.index.date)[['value']].mean()

this then allows you to groupby by month and generate a boxplot:

by_month = by_date.groupby(by_date.index.month)
by_month.boxplot(subplots=False)

The use of double subsription is a subtle feature which is not immediately obvious, generally doing df[col] will return a column, but we know that passing a list of columns col_list will return a df: df[col_list] which when expanded is the same as df[[col_a, col_b]] this then leads to the conclusion that we can return a df if we did the following: df[[col_a]] as we've passed a list with a single element, this is not the same as df[col_a] where we've passed a label to perform column indexing.

Aggregate by repeated datetime index with different identifiers in a column on a pandas dataframe

Answers (2)

Related Questions