Reputation: 567
I have a time-series that contains multiple values at each datetime. Each datetime index has an associated datetime where the value was loaded, or 'loadtime', like so:
import datetime as dt
import numpy as np
import pandas as pd
# time-series index
t = pd.date_range('09/01/2017', '09/02/2017', freq='1H')
t = t.repeat(3)
n = len(t)
# data values
y = np.full((n), 0.0)
y = y.reshape(n//3, 3)
y[:, 1] = 1.0
y[:, 2] = 2.0
y = y.flatten()
# load timestamp
random_range = np.arange(0, 60)
base_date = np.datetime64('2017-10-01 12:00')
loadtimes = [base_date + np.random.choice(random_range) for x in range(n)]
df = pd.DataFrame(index=t, data={'y': y, 'loadtime': loadtimes})
>>> df.head(12)
loadtime y
2017-09-01 00:00:00 2017-10-02 01:59:00 0.0
2017-09-01 00:00:00 2017-10-02 09:23:00 1.0
2017-09-01 00:00:00 2017-10-02 03:35:00 2.0
2017-09-01 01:00:00 2017-10-01 17:26:00 0.0
2017-09-01 01:00:00 2017-10-01 16:44:00 1.0
2017-09-01 01:00:00 2017-10-02 12:50:00 2.0
2017-09-01 02:00:00 2017-10-02 11:30:00 0.0
2017-09-01 02:00:00 2017-10-02 11:17:00 1.0
2017-09-01 02:00:00 2017-10-01 20:23:00 2.0
2017-09-01 03:00:00 2017-10-02 15:27:00 0.0
2017-09-01 03:00:00 2017-10-02 18:08:00 1.0
2017-09-01 03:00:00 2017-10-01 16:06:00 2.0
So far, I have come up with this solution to iterate over all unique values...but this could be expensive as the time-series increasing in length (and multiple values). It looks like a bit of a hack and not so clean:
new_index = df.index.unique()
df_new = pd.DataFrame(index=new_index, columns=['y'])
# cycle through unique indices to find max loadtime
dfg = df.groupby(df.index)
for i, dfg_i in dfg:
max_index = dfg_i['loadtime'] == dfg_i['loadtime'].max()
if i in df_new.index:
df_new.loc[i, 'y'] = dfg_i.loc[max_index, 'y'].values[0] # WHY IS THIS A LIST?
>>> df_new.head()
y
2017-09-01 00:00:00 1
2017-09-01 01:00:00 2
2017-09-01 02:00:00 0
2017-09-01 03:00:00 1
2017-09-01 04:00:00 1
How can I take a time-series with the latest 'loadtime' for each unique index? Is there a more pythonic solution?
Upvotes: 1
Views: 270
Reputation: 863741
First create column from DatetimeIndex
and then set_index
by y
column. Then use groupby
with idxmax
what return index (here y
column values) by max values in loadtime
per groups:
print (df.rename_axis('dat')
.reset_index()
.set_index('y')
.groupby('dat')['loadtime']
.idxmax()
.to_frame('y'))
y
dat
2017-09-01 00:00:00 1.0
2017-09-01 01:00:00 2.0
2017-09-01 02:00:00 0.0
2017-09-01 03:00:00 1.0
Detail:
print (df.rename_axis('dat')
.reset_index()
.set_index('y'))
dat loadtime
y
0.0 2017-09-01 00:00:00 2017-10-02 01:59:00
1.0 2017-09-01 00:00:00 2017-10-02 09:23:00
2.0 2017-09-01 00:00:00 2017-10-02 03:35:00
0.0 2017-09-01 01:00:00 2017-10-01 17:26:00
1.0 2017-09-01 01:00:00 2017-10-01 16:44:00
2.0 2017-09-01 01:00:00 2017-10-02 12:50:00
0.0 2017-09-01 02:00:00 2017-10-02 11:30:00
1.0 2017-09-01 02:00:00 2017-10-02 11:17:00
2.0 2017-09-01 02:00:00 2017-10-01 20:23:00
0.0 2017-09-01 03:00:00 2017-10-02 15:27:00
1.0 2017-09-01 03:00:00 2017-10-02 18:08:00
2.0 2017-09-01 03:00:00 2017-10-01 16:06:00
Timings:
t = pd.date_range('01/01/2017', '12/25/2017', freq='1H')
#len(df)
#[25779 rows x 2 columns]
In [225]: %timeit (df.rename_axis('dat').reset_index().set_index('y').groupby('dat')['loadtime'].idxmax().to_frame('y'))
1 loop, best of 3: 870 ms per loop
In [226]: %timeit df.groupby(level=0).apply(lambda x : x.set_index('y').idxmax()).rename(columns={'loadtime':'y'})
1 loop, best of 3: 4.96 s per loop
Upvotes: 2
Reputation: 30605
You can make use of groupby level 0
and apply i.e
ndf = df.groupby(level=0).apply(lambda x : x.set_index('y').idxmax()).rename(columns={'loadtime':'y'})
Ouptut : ndf.head()
y 2017-09-01 00:00:00 1.0 2017-09-01 01:00:00 1.0 2017-09-01 02:00:00 2.0 2017-09-01 03:00:00 1.0 2017-09-01 04:00:00 1.0
Upvotes: 2