David Hagan
David Hagan

Reputation: 1166

Plotting with GroupBy in Pandas/Python

Although it is straight-forward and easy to plot groupby objects in pandas, I am wondering what the most pythonic (pandastic?) way to grab the unique groups from a groupby object is. For example: I am working with atmospheric data and trying to plot diurnal trends over a period of several days or more. The following is the DataFrame containing many days worth of data where the timestamp is the index:

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10909 entries, 2013-08-04 12:01:00 to 2013-08-13 17:43:00
Data columns (total 17 columns):
Date     10909  non-null values
Flags    10909  non-null values
Time     10909  non-null values
convt    10909  non-null values
hino     10909  non-null values
hinox    10909  non-null values
intt     10909  non-null values
no       10909  non-null values
nox      10909  non-null values
ozonf    10909  non-null values
pmtt     10909  non-null values
pmtv     10909  non-null values
pres     10909  non-null values
rctt     10909  non-null values
smplf    10909  non-null values
stamp    10909  non-null values
no2      10909  non-null values
dtypes: datetime64[ns](1), float64(11), int64(2), object(3)

To be able to average (and take other statistics) the data at every minute for several days, I group the dataframe: data = no.groupby('Time')

I can then easily plot the mean NO concentration as well as quartiles:

ax = figure(figsize=(12,8)).add_subplot(111)
title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
ylabel('Concentration [ppb]')
data.no.mean().plot(ax=ax, style='b', label='Mean')
data.no.apply(lambda x: percentile(x, 25)).plot(ax=ax, style='r', label='25%')
data.no.apply(lambda x: percentile(x, 75)).plot(ax=ax, style='r', label='75%')

The issue that fuels my question, is that in order to plot more interesting looking things like plots using like fill_between() it is necessary to know the x-axis information per the documentation

fill_between(x, y1, y2=0, where=None, interpolate=False, hold=None, **kwargs)

For the life of me, I cannot figure out the best way to accomplish this. I have tried:

  1. Iterating over the groupby object and creating an array of the groups
  2. Grabbing all of the unique Time entries from the original DataFrame

I can make these work, but I know there is a better way. Python is far too beautiful. Any ideas/hints?

UPDATES: The statistics can be dumped into a new dataframe using unstack() such as

no_new = no.groupby('Time')['no'].describe().unstack()
no_new.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1440 entries, 00:00 to 23:59
Data columns (total 8 columns):
count    1440  non-null values
mean     1440  non-null values
std      1440  non-null values
min      1440  non-null values
25%      1440  non-null values
50%      1440  non-null values
75%      1440  non-null values
max      1440  non-null values
dtypes: float64(8)

Although I should be able to plot with fill_between() using no_new.index, I receive a TypeError.

Current Plot code and TypeError:

ax = figure(figzise=(12,8)).add_subplot(111)
ax.plot(no_new['mean'])
ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5, facecolor='green')

TypeError:

TypeError                                 Traceback (most recent call last)
<ipython-input-6-47493de920f1> in <module>()
      2 ax = figure(figsize=(12,8)).add_subplot(111)
      3 ax.plot(no_new['mean'])
----> 4 ax.fill_between(no_new.index, no_new['mean'], no_new['75%'], alpha=.5,     facecolor='green')
      5 #title('Diurnal Profile for NO, NO2, and NOx: East St. Louis Air Quality Study')
      6 #ylabel('Concentration [ppb]')

C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\matplotlib\axes.pyc in fill_between(self, x, y1, y2, where, interpolate, **kwargs)
   6986 
   6987         # Convert the arrays so we can work with them
-> 6988         x = ma.masked_invalid(self.convert_xunits(x))
   6989         y1 = ma.masked_invalid(self.convert_yunits(y1))
   6990         y2 = ma.masked_invalid(self.convert_yunits(y2))

C:\Users\David\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\ma\core.pyc in masked_invalid(a, copy)
   2237         cls = type(a)
   2238     else:
-> 2239         condition = ~(np.isfinite(a))
   2240         cls = MaskedArray
   2241     result = a.view(cls)

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The plot as of now looks like this:enter image description here

Upvotes: 4

Views: 8133

Answers (1)

Garrett
Garrett

Reputation: 49886

Storing the groupby stats (mean/25/75) as columns in a new dataframe and then passing the new dataframe's index as the x parameter of plt.fill_between() works for me (tested with matplotlib 1.3.1). e.g.,

gdf = df.groupby('Time')[col].describe().unstack()
plt.fill_between(gdf.index, gdf['25%'], gdf['75%'], alpha=.5)

gdf.info() should look like this:

<class 'pandas.core.frame.DataFrame'>
Index: 12 entries, 00:00:00 to 22:00:00
Data columns (total 8 columns):
count    12 non-null float64
mean     12 non-null float64
std      12 non-null float64
min      12 non-null float64
25%      12 non-null float64
50%      12 non-null float64
75%      12 non-null float64
max      12 non-null float64
dtypes: float64(8)

Update: to address the TypeError: ufunc 'isfinite' not supported exception, it is necessary to first convert the Time column from a series of string objects in "HH:MM" format to a series of datetime.time objects, which can be done as follows:

df['Time'] = df.Time.map(lambda x: pd.datetools.parse(x).time())

Upvotes: 5

Related Questions