EngStan
EngStan

Reputation: 383

Pandas dataframe: group across years

In Pandas, is there a groupby operation to group values across multiple years, when the rest of the timestamp is the same?

For example 12:00:00 01/01/2000, 12:00:00 01/01/2001 and 12:00:00 01/01/2002 would form a group, as would 15:00:00 01/01/2000, 15:00:00 01/01/2001 and 15:00:00 01/01/2002... etc.

I can sort of achieve this with:

group = pd.groupby(timeseries, by=[timeseries.index.minute, timeseries.index.hour, timeseries.index.day, timeseries.index.month])

but it is really ugly and not flexible to the input time format. What I really want is a way of excluding the year from the groupby, but including everything else.

Upvotes: 2

Views: 1667

Answers (4)

I found the .day_of_year attribute in the pandas datetime module while working in a similar problem.

You could then index with hours of year

timeseries.groupby([timeseries.index.day_of_year*24 + timeseries.index.hour])

which also works with leap years.

To get back a date format you may choose any year provided it is a leap year, in case your time series includes a February 26th:

newseries.index.map(lambda t: pd.to_datetime('2000-01-01')+pd.Timedelta(hours=t))

Upvotes: 0

shoyer
shoyer

Reputation: 9603

You can subtract off a time-stamp for the start of each year to create a TimedeltaIndex. One way to do this in a vectorized way is converting to a different time-resolution datetime64 type with NumPy, e.g.,

>>> idx = pd.date_range('2000-01-01', periods=1000)

>>> idx
DatetimeIndex(['2001-01-01', '2001-01-02', '2001-01-03', '2001-01-04',
               '2001-01-05', '2001-01-06', '2001-01-07', '2001-01-08',
               '2001-01-09', '2001-01-10',
               ...
               '2003-09-18', '2003-09-19', '2003-09-20', '2003-09-21',
               '2003-09-22', '2003-09-23', '2003-09-24', '2003-09-25',
               '2003-09-26', '2003-09-27'],
              dtype='datetime64[ns]', length=1000, freq='D')

# note that pandas only handles datetime64[ns], so we convert back
# using pd.to_datetime

>>> deltas = idx - pd.to_datetime(idx.values.astype('datetime64[Y]'))

>>> deltas
TimedeltaIndex([  '0 days',   '1 days',   '2 days',   '3 days',   '4 days',
                  '5 days',   '6 days',   '7 days',   '8 days',   '9 days',
                ...
                '260 days', '261 days', '262 days', '263 days', '264 days',
                '265 days', '266 days', '267 days', '268 days', '269 days'],
               dtype='timedelta64[ns]', length=1000, freq=None)

>>> group = pd.Series(np.arange(1000), idx).groupby(deltas)

One thing to watch out for is leap-years -- you will end up with a few dangling datetime differences of 365-366 days.

Upvotes: 2

Vaishali
Vaishali

Reputation: 38415

Something like this?

t = pd.Series(['12:00:00 01/01/2000', '12:00:00 01/01/2001' ,'12:00:00 01/01/2002' ,'15:00:00 01/01/2000', '15:00:00 01/01/2001','15:00:00 01/01/2002'])
df = pd.DataFrame(t.str.split(' ',1).tolist(),
                               columns = ['Time','Date'])
df.groupby(['Time', 'Date']).required_function()

You get the data grouped on Time and then Date

Upvotes: 0

jezrael
jezrael

Reputation: 862641

You can set some constant year and then groupby by index:

timeseries.index = timeseries.index.map(lambda t: t.replace(year=2010))
print (timeseries)
group = timeseries.groupby(level=0).sum()
print (group)

Upvotes: 1

Related Questions