Reputation: 6836
I have a Pandas DataFrame that includes a date
column. Elements of that column are of type pandas.tslib.Timestamp
.
I'd like to group the dataframe by date, but exclude timestamp information that is more granular that date (ie. grouping by date, where all Feb 23, 2011
are grouped). I know how to express this in SQL, but am quite new to Pandas.
This question does something very similar, but I don't understand the code and it uses datetime
objects.
From the documentation, I don't even understand how to retrieve the date from a Pandas Timestamp object. I could convert to datetime
object, but that seems very roundabout.
As requested, the output of df.head()
:
date show network timed session_id
0 2011-12-03 02:48:52 Monk TV38 670 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
1 2011-12-03 03:00:09 WBZ News TV38 205 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
2 2011-12-03 03:04:04 Dateline NBC NBC 30 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
3 2011-12-03 03:04:35 20/20 ABC 25 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
4 2011-12-03 03:04:56 College Football FOX 55 00003DA9-01D2-E7A9-4177-203BE6A9E2BA
Upvotes: 10
Views: 16544
Reputation: 375377
You can use the normalize
DatetimeIndex method (which takes it to midnight that day):
In [11]: df['date']
Out[11]:
0 2011-12-03 02:48:52
1 2011-12-03 03:00:09
2 2011-12-03 03:04:04
3 2011-12-03 03:04:35
4 2011-12-03 03:04:56
Name: date, dtype: datetime64[ns]
In [12]: pd.DatetimeIndex(df['date']).normalize()
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-12-03 00:00:00, ..., 2011-12-03 00:00:00]
Length: 5, Freq: None, Timezone: None
And you can groupby this:
g = df.groupby(pd.DatetimeIndex(df['date']).normalize())
In 0.15 you'll have access to the dt attribute, so can write this as:
g = df.groupby(df['date'].dt.normalize())
Upvotes: 21
Reputation: 73
It's not clear if you are trying to groupby and aggregate (like in SQL) or create an index with a date instead of a timestamp.
If you're trying to groupby and aggregrate, you could do it this way:
df.groupby(df.set_index('date').index.date).mean()
Timeseries indexes have datetime properites like day, date, etc. That will aggregate the timed column since it's the only numeric column.
If you're trying to create an index with a date level, you could do something like:
import datetime
df.set_index(['date', df.date.apply(lambda x: datetime.datetime.date(x))], inplace=True)
df.index.names = ['timestamp', 'daydate']
This will give you a multiindex with the timestamp and a date . If you don't want the index to be permanent, drop the inplace= argument.
Upvotes: 6