Reputation: 273
I have just discovered the power of Pandas and I love it, but I can't figure out this problem:
I have a DataFrame df.head()
:
lon lat h filename time
0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467
1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
4 19.570708 80.588183 0.162943 60048 2002-05-15 12:59:32.888467
I would like to group my data into nine day intervals
gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))
The first group:
2002-05-15 12:59:31.717467 lon lat h filename time
0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467
1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
...
Next group:
2002-05-24 12:59:31.717467 lon lat height filename time
815 18.309498 80.457024 0.187387 60309 2002-05-24 16:35:39.553563
816 18.291458 80.458514 0.061446 60309 2002-05-24 16:35:39.610563
817 18.273408 80.460014 0.129255 60309 2002-05-24 16:35:39.667563
818 18.255358 80.461504 0.046761 60309 2002-05-24 16:35:39.724563
...
So the data are grouped in nine days counting from the first time ( 12:59:31.717467), and not from the beginning of the day as I would like.
When grouping by one day:
gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))
gives me:
2002-05-15 00:00:00 lon lat h filename time
0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467
1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
...
I can just loop over the days until I get a nine day interval, but I think it could be done smarter, I am looking for a Grouper freq
option equivalent to YS (start of year) just for days, a way of setting the start time (maybe by the Grouper option convention : {‘start’, ‘end’, ‘e’, ‘s’}
), or???
I am running Python 3.5.2 and Pandas is in version: 0.19.0
Upvotes: 11
Views: 11394
Reputation: 11
I know this is a late entry. But according to the pandas.Grouper documentation, you can set the origin
option of the pandas.Grouper
object to "start_day"
, as in:
for _, grp in df.groupby(pd.Grouper(freq="9D", origin="start_day")):
print(grp)
This option was added in the 1.1.0 version.
Upvotes: 1
Reputation: 29711
Dropping first time row:
Your best bet would be to normalize
the first row of the datetime
column so that the time is reset to 00:00:00
(midnight) and group according to the 9D interval:
df.loc[0, 'time'] = df['time'].iloc[0].normalize()
for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')):
print (grp)
# lon lat h filename time
# 0 19.961216 80.617627 -0.077165 60048 2002-05-15 00:00:00.000000
# 1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
# 2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
# 3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
# 4 19.570708 80.588183 0.162943 60048 2002-05-15 12:59:32.888467
# ......................................................................
This restores the time in the other rows and so you wouldn't lose that information.
Keeping first time row:
If you want to keep the first time row as it is and not make any changes to it, but only want to start grouping from midnight onwards, you could do:
df_t_shift = df.shift() # Shift one level down
df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize()
# Concat last row of df with the shifted one to account for the loss of row
df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True)
for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')):
print (grp)
# lon lat h filename time
# 0 NaN NaN NaN NaN 2002-05-15 00:00:00.000000
# 1 19.961216 80.617627 -0.077165 60048.0 2002-05-15 12:59:31.717467
# 2 19.923916 80.614847 -0.018689 60048.0 2002-05-15 12:59:31.831467
# 3 19.849396 80.609257 -0.089205 60048.0 2002-05-15 12:59:32.059467
# 4 19.830776 80.607857 0.076485 60048.0 2002-05-15 12:59:32.116467
# 5 19.570708 80.588183 0.162943 60048.0 2002-05-15 12:59:32.888467
Upvotes: 3
Reputation: 15545
If you truncate the datetimes to midnight of the given day, the grouping will work as expected (starting at the beginning of the day). I expected it to work by converting to datetimes, e.g
df['date'] = df['time'].apply(lambda x:x.date())
However, you can not use TimeGrouper
unless the index is a datetime
.
You instead have two options, either truncate the datetimes to midnight directly as follows:
df['date'] = df['time'].apply(lambda x:x.replace(hour=0, minute=0, second=0, microsecond=0)))
Alternatively, you can first generate date
values, then convert them back to datetimes, using the pd.to_datetime()
function:
df['date'] = df['time'].apply(lambda x: x.date() )
df['date'] = pd.to_datetime(df['date'])
Upvotes: 1
Reputation: 17122
completing @mfitzp answer you could do this:
df['dateonly'] = df['time'].apply(lambda x: x.date())
only problem with that is df['dateonly']
would not be a DatetimeIndex
you need to convert it first:
df['dateonly'] = pd.to_datetime(df['dateonly'])
now you can group on it
gb = df.groupby(pd.TimeGrouper(key='dateonly', freq='9D'))
and for extra information convention
is used with PeriodIndex
not DatetimeIndex
Upvotes: 1