Norfeldt
Norfeldt

Reputation: 9638

How to plot data in pandas by date and perform grouping at the same time

I often find myself wanting to plot data in one column by another, but find it hard to group/separate them by a 3th column.

Suppose I have a sheet like this

enter image description here

How would I create the same plot in pandas?

BTW: I like that the x-axis is linear and not just group of dates aligned next to each other since it gives and idea about how close the measurements are made to each other within a group - but would be nice to know how to do both in case the distance is too far.

UPDATE

The answer from @Ffisegydd has been very useful. However I was a bit too fast accepting the answer - I discovered that when trying out the code on an actual excel sheet. The problem is entirely my fault since I didn't provide an Excel sheet. @Ffisegydd was so kind to manually create the dataframe from my question, but working with excel files is a bit different.

I do apoligize. Here is an Excel file: https://dl.dropboxusercontent.com/u/3216968/Example.xlsx

This is how far I got (in IPython notebook)

import pandas as pd
import datetime as dt

path2file = r"C:\Example.xlsx"
_xl = pd.ExcelFile(path2file)
df = pd.read_excel(path2file, _xl.sheet_names[0], header=0)
df

enter image description here

df.Date = df.Date.apply( lambda x: dt.datetime.strptime(x, '%Y.%m.%d').date() )
df

enter image description here

Here is where things go wrong:

pd.DataFrame( data= [df.Data, df.Group], columns = ['Data', 'Group'], index=df.Date)

Giving this error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-231baa928f67> in <module>()
----> 1 pd.DataFrame( data= [df.Data, df.Group], columns = ['Data', 'Group'], index=df.Date)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
    245                         index = _default_index(len(data))
    246                     mgr = _arrays_to_mgr(arrays, columns, index, columns,
--> 247                                          dtype=dtype)
    248                 else:
    249                     mgr = self._init_ndarray(data, index, columns, dtype=dtype,

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   4471     axes = [_ensure_index(columns), _ensure_index(index)]
   4472 
-> 4473     return create_block_manager_from_arrays(arrays, arr_names, axes)
   4474 
   4475 

C:\Python27\lib\site-packages\pandas\core\internals.pyc in create_block_manager_from_arrays(arrays, names, axes)
   3757         return mgr
   3758     except (ValueError) as e:
-> 3759         construction_error(len(arrays), arrays[0].shape[1:], axes, e)
   3760 
   3761 

C:\Python27\lib\site-packages\pandas\core\internals.pyc in construction_error(tot_items, block_shape, axes, e)
   3729         raise e
   3730     raise ValueError("Shape of passed values is {0}, indices imply {1}".format(
-> 3731         passed,implied))
   3732 
   3733 def create_block_manager_from_blocks(blocks, axes):

ValueError: Shape of passed values is (2,), indices imply (2, 12)

or doing this

pd.DataFrame( {'data': df.Data, 'group': df.Group}, index=df.Date)

enter image description here

Upvotes: 3

Views: 6848

Answers (2)

Ffisegydd
Ffisegydd

Reputation: 53668

You can create a groupby object and then iterate over the groups and plot.

Below is some code which takes your data and plots the two "groups". There is some extra formatting to make the graph look good as well.

import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt

path2file = r"Example.xlsx"
_xl = pd.ExcelFile(path2file)
df = pd.read_excel(path2file, _xl.sheet_names[0], header=0) 

df.Date = df.Date.apply( lambda x: dt.datetime.strptime(x, '%Y.%m.%d').date())
df.index = df.Date # Set the Date column as your index
del df['Date'] # Remove the Date column from your data columns

grouped = df.groupby('Group') # groupby object

# Normally you would just iterate using "for k, g in grouped:" but the i
# is necessary for selecting a color.
colors = ['red', 'blue']
for i, (k, g) in enumerate(grouped):
    plt.plot_date(g['Data'].index, g['Data'], linestyle='None', marker='o', mfc=colors[i], label=k)

plt.legend()
plt.gcf().autofmt_xdate() # Format the dates with a diagonal slant to make them fit.

# Pad the data out so all markers can be seen.
pad = dt.timedelta(days=7)
plt.xlim((min(df.index)-pad, max(df.index)+pad))
plt.ylim(0,6)

Plot

Upvotes: 4

user1827356
user1827356

Reputation: 7022

This should work

df.pivot_table(rows=['Date'], cols=['Group'], values=['Data']).plot()

But be aware that each data point will be a 'mean' of the data points in a particular group for a given day

Upvotes: 2

Related Questions