user1643523
user1643523

Reputation: 273

Grouping Pandas DataFrame by n days starting in the begining of the day

I have just discovered the power of Pandas and I love it, but I can't figure out this problem:

I have a DataFrame df.head():

   lon   lat  h  filename                  time
0  19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1  19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2  19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3  19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
4  19.570708  80.588183     0.162943     60048 2002-05-15 12:59:32.888467

I would like to group my data into nine day intervals

gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))

The first group:

2002-05-15 12:59:31.717467       lon   lat  h filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

Next group:

2002-05-24 12:59:31.717467        lon   lat  height  filename                  time
815   18.309498  80.457024     0.187387     60309 2002-05-24 16:35:39.553563
816   18.291458  80.458514     0.061446     60309 2002-05-24 16:35:39.610563
817   18.273408  80.460014     0.129255     60309 2002-05-24 16:35:39.667563
818   18.255358  80.461504     0.046761     60309 2002-05-24 16:35:39.724563
...

So the data are grouped in nine days counting from the first time ( 12:59:31.717467), and not from the beginning of the day as I would like.

When grouping by one day:

gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))

gives me:

2002-05-15 00:00:00       lon   lat  h  filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

I can just loop over the days until I get a nine day interval, but I think it could be done smarter, I am looking for a Grouper freq option equivalent to YS (start of year) just for days, a way of setting the start time (maybe by the Grouper option convention : {‘start’, ‘end’, ‘e’, ‘s’}), or???

I am running Python 3.5.2 and Pandas is in version: 0.19.0

Upvotes: 11

Views: 11394

Answers (4)

schardong
schardong

Reputation: 11

I know this is a late entry. But according to the pandas.Grouper documentation, you can set the origin option of the pandas.Grouper object to "start_day", as in:

for _, grp in df.groupby(pd.Grouper(freq="9D", origin="start_day")):
    print(grp)

This option was added in the 1.1.0 version.

Upvotes: 1

Nickil Maveli
Nickil Maveli

Reputation: 29711

Dropping first time row:

Your best bet would be to normalize the first row of the datetime column so that the time is reset to 00:00:00(midnight) and group according to the 9D interval:

df.loc[0, 'time'] = df['time'].iloc[0].normalize()
for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0  19.961216  80.617627 -0.077165     60048 2002-05-15 00:00:00.000000
# 1  19.923916  80.614847 -0.018689     60048 2002-05-15 12:59:31.831467
# 2  19.849396  80.609257 -0.089205     60048 2002-05-15 12:59:32.059467
# 3  19.830776  80.607857  0.076485     60048 2002-05-15 12:59:32.116467
# 4  19.570708  80.588183  0.162943     60048 2002-05-15 12:59:32.888467
# ......................................................................

This restores the time in the other rows and so you wouldn't lose that information.


Keeping first time row:

If you want to keep the first time row as it is and not make any changes to it, but only want to start grouping from midnight onwards, you could do:

df_t_shift = df.shift()    # Shift one level down
df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize()
# Concat last row of df with the shifted one to account for the loss of row
df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True)  

for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0        NaN        NaN       NaN       NaN 2002-05-15 00:00:00.000000
# 1  19.961216  80.617627 -0.077165   60048.0 2002-05-15 12:59:31.717467
# 2  19.923916  80.614847 -0.018689   60048.0 2002-05-15 12:59:31.831467
# 3  19.849396  80.609257 -0.089205   60048.0 2002-05-15 12:59:32.059467
# 4  19.830776  80.607857  0.076485   60048.0 2002-05-15 12:59:32.116467
# 5  19.570708  80.588183  0.162943   60048.0 2002-05-15 12:59:32.888467

Upvotes: 3

mfitzp
mfitzp

Reputation: 15545

If you truncate the datetimes to midnight of the given day, the grouping will work as expected (starting at the beginning of the day). I expected it to work by converting to datetimes, e.g

df['date'] = df['time'].apply(lambda x:x.date())

However, you can not use TimeGrouper unless the index is a datetime. You instead have two options, either truncate the datetimes to midnight directly as follows:

df['date'] = df['time'].apply(lambda x:x.replace(hour=0, minute=0, second=0, microsecond=0)))

Alternatively, you can first generate date values, then convert them back to datetimes, using the pd.to_datetime() function:

df['date'] = df['time'].apply(lambda x: x.date() )
df['date'] = pd.to_datetime(df['date'])

Upvotes: 1

Steven G
Steven G

Reputation: 17122

completing @mfitzp answer you could do this:

df['dateonly'] = df['time'].apply(lambda x: x.date())

only problem with that is df['dateonly'] would not be a DatetimeIndex

you need to convert it first:

df['dateonly'] = pd.to_datetime(df['dateonly'])

now you can group on it

gb = df.groupby(pd.TimeGrouper(key='dateonly', freq='9D'))

and for extra information conventionis used with PeriodIndexnot DatetimeIndex

Upvotes: 1

Related Questions