vicemagui
vicemagui

Reputation: 151

Down-sampling specific period on dataframe using Pandas

I have a long time serie that starts in 1963 and ends in 2013. However, from 1963 til 2007 it has an hourly sampling period while after 2007's sampling rate changes to 5 minutes. Is it possible to resample data just after 2007 in a way that the entire time serie has hourly data sampling? Data slice below.

yr, m, d, h, m, s, sl
2007, 11, 30, 19, 0, 0, 2180
2007, 11, 30, 20, 0, 0, 2310
2007, 11, 30, 21, 0, 0, 2400
2007, 11, 30, 22, 0, 0, 2400
2007, 11, 30, 23, 0, 0, 2270
2008, 1, 1, 0, 0, 0, 2210
2008, 1, 1, 0, 5, 0, 2210
2008, 1, 1, 0, 10, 0, 2210
2008, 1, 1, 0, 15, 0, 2200
2008, 1, 1, 0, 20, 0, 2200
2008, 1, 1, 0, 25, 0, 2200
2008, 1, 1, 0, 30, 0, 2200
2008, 1, 1, 0, 35, 0, 2200
2008, 1, 1, 0, 40, 0, 2200
2008, 1, 1, 0, 45, 0, 2200
2008, 1, 1, 0, 50, 0, 2200
2008, 1, 1, 0, 55, 0, 2200
2008, 1, 1, 1, 0, 0, 2190
2008, 1, 1, 1, 5, 0, 2190  

Thanks!

Upvotes: 1

Views: 1522

Answers (3)

piRSquared
piRSquared

Reputation: 294488

Give your dataframe proper column names

df.columns = 'year month day hour minute second sl'.split()

Solution

df.groupby(['year', 'month', 'day', 'hour'], as_index=False).first()

   year  month  day  hour  minute  second    sl
0  2007     11   30    19       0       0  2180
1  2007     11   30    20       0       0  2310
2  2007     11   30    21       0       0  2400
3  2007     11   30    22       0       0  2400
4  2007     11   30    23       0       0  2270
5  2008      1    1     0       0       0  2210
6  2008      1    1     1       0       0  2190

Option 2
Here is an option that builds off of the column renaming. We'll use pd.to_datetime to cleverly get at our dates, then use resample. However, you have time gaps and will have to address nulls and re-cast dtypes.

df.set_index(
    pd.to_datetime(df.drop('sl', 1))
).resample('H').first().dropna().astype(df.dtypes)

                     year  month  day  hour  minute  second    sl
2007-11-30 19:00:00  2007     11   30    19       0       0  2180
2007-11-30 20:00:00  2007     11   30    20       0       0  2310
2007-11-30 21:00:00  2007     11   30    21       0       0  2400
2007-11-30 22:00:00  2007     11   30    22       0       0  2400
2007-11-30 23:00:00  2007     11   30    23       0       0  2270
2008-01-01 00:00:00  2008      1    1     0       0       0  2210
2008-01-01 01:00:00  2008      1    1     1       0       0  2190

Upvotes: 2

ARASH
ARASH

Reputation: 428

You'd better first append a datetime column to your dataframe:
df['datetime'] = pd.to_datetime(df[['yr', 'mnth', 'd', 'h', 'm', 's']])

But before that you should rename the month column:
df.rename(columns={ df.columns[1]: "mnth" })

Then you set a datetime column as dataframe index:
data.set_index('datetime', inplace=True)

Now you can apply resample method on your dataframe on by prefereed sampling rate:
df.resample('60T', on='datatime').mean()

Here I used mean to aggregate. You can use other method based on your need. See Pandas document as a ref.

Upvotes: 2

Egal
Egal

Reputation: 1474

Rename the minute column for convenience:

df.columns = ['yr', 'm', 'd', 'h', 'M', 's', 'sl']

Create a datetime column:

from datetime import datetime as dt
df['dt'] = df.apply(axis=1, func=lambda x: dt(x.yr, x.m, x.d, x.h, x.M, x.s))

Resample:

For pandas < 0.19:

df = df.set_index('dt').resample('60T').reset_index('dt')

For pandas >= 0.19:

df = df.resample('60T', on='dt')

Upvotes: 2

Related Questions