Reputation: 1675
I have a pandas.DataFrame
containing start
and end
columns, plus a couple of additional columns. I would like to expand this dataframe into a time series that starts at start
values and end at end
values, but copying my other columns. So far I came up with the following:
import pandas as pd
import datetime as dt
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']
data_series = list()
for row in df.itertuples():
time_range = pd.bdate_range(row.start, row.end)
s = len(time_range)
data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))
columns_names = ['date', 'start', 'end', 'country', 'letter']
df = pd.DataFrame(data_series, columns=columns_names)
Starting Dataframe:
start end country letter
0 2017-04-03 2017-04-10 US a
1 2017-04-05 2017-04-12 EU b
2 2017-04-10 2017-04-17 UK c
Desired output:
date start end country letter
0 2017-04-03 2017-04-03 2017-04-10 US a
1 2017-04-04 2017-04-03 2017-04-10 US a
2 2017-04-05 2017-04-03 2017-04-10 US a
3 2017-04-06 2017-04-03 2017-04-10 US a
4 2017-04-07 2017-04-03 2017-04-10 US a
5 2017-04-10 2017-04-03 2017-04-10 US a
6 2017-04-05 2017-04-05 2017-04-12 EU b
7 2017-04-06 2017-04-05 2017-04-12 EU b
8 2017-04-07 2017-04-05 2017-04-12 EU b
9 2017-04-10 2017-04-05 2017-04-12 EU b
10 2017-04-11 2017-04-05 2017-04-12 EU b
11 2017-04-12 2017-04-05 2017-04-12 EU b
12 2017-04-10 2017-04-10 2017-04-17 UK c
13 2017-04-11 2017-04-10 2017-04-17 UK c
14 2017-04-12 2017-04-10 2017-04-17 UK c
15 2017-04-13 2017-04-10 2017-04-17 UK c
16 2017-04-14 2017-04-10 2017-04-17 UK c
17 2017-04-17 2017-04-10 2017-04-17 UK c
Problem with my solution is that when applying it to a much bigger dataframe (mostly in terms of rows), it does not achieve a result fast enough for me. Does anybody have any ideas of how I could improve? I am also considering solutions in numpy.
Upvotes: 9
Views: 2714
Reputation: 119
Since the goal here is speed, we should vectorize every step. A single for-loop can slow the code by multiple orders-of-magnitude. A vectorized solution is provided below:
cols = list(df.columns)
df['data_id'] = np.arange(0, len(df))
data_id = df['data_id']
start = df['start']
end = df['end']
diff = ((end-start) / np.timedelta64(1, 'D')).astype('int') + 1
repeated_id = np.repeat(data_id, diff)
time_df = pd.DataFrame(data={'data_id': repeated_id})
time_df = pd.merge(left=time_df, right=df[['data_id', 'start']], on=['data_id'])
time_df['day_id'] = np.arange(0, len(time_df))
min_day_id = time_df.groupby('data_id')['day_id'].min().reset_index().rename(columns={'day_id': 'min_day_id'})
time_df = pd.merge(left=time_df, right=min_day_id, on=['data_id'])
days_to_add = (time_df['day_id'] - time_df['min_day_id']) * np.timedelta64(1, 'D')
time_df['date'] = time_df['start'] + days_to_add
time_df = time_df[time_df['date'].dt.dayofweek < 5]
df = pd.merge(left=df, right=time_df[['data_id', 'date']], on=['data_id'])
df = df[['date']+cols]
The vectorized version works as follows:
When comparing jezrael's timed runs, the 'original method' took 1.15s on my machine, while the vectorized version took 56.9ms, a 20-fold speedup.
Upvotes: 0
Reputation: 863341
Timings + 3 another solutions:
#original solution
In [163]: %%timeit
...: data_series = list()
...: for row in df.itertuples():
...: time_range = pd.bdate_range(row.start, row.end)
...: s = len(time_range)
...: data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))
...:
...: columns_names = ['date', 'start', 'end', 'country', 'letter']
...: df3 = pd.DataFrame(data_series, columns=columns_names)
...:
1 loop, best of 3: 634 ms per loop
#Stephen Rauch solution, a bit changed because warnings
In [164]: %%timeit
...: dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
...: for row in df[['start', 'end']].iterrows()]
...: deltas = [len(x) for x in dates]
...: dates = pd.Series(pd.concat(dates).values, name='date')
...: df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
...: df2['start'] = pd.to_datetime(df2['start'])
...: df2['end'] = pd.to_datetime(df2['end'])
...: df2 = pd.concat([dates, df2], axis=1)
...:
1 loop, best of 3: 669 ms per loop
#maxu solution
In [165]: %%timeit
...: dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
...: lens = [len(x) for x in dates]
...: r = pd.DataFrame(
...: {col:np.repeat(df[col].values, lens) for col in df.columns}
...: ).assign(date=np.concatenate(dates))
...:
1 loop, best of 3: 609 ms per loop
#jezrael solution1
In [166]: %%timeit
...: df1 = pd.concat([pd.Series(r.Index,
...: pd.bdate_range(r.start, r.end))
...: for r in df.itertuples()]).reset_index()
...: df1.columns = ['date','idx']
...: df2 = df1.set_index('idx').join(df).reset_index(drop=True)
...:
1 loop, best of 3: 632 ms per loop
#jezrael solution2 (improved maxu solution)
In [167]: %%timeit
...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
...: lens = [len(x) for x in dates]
...:
...: df4 = pd.DataFrame(
...: {col:np.repeat(df[col].values, lens) for col in df.columns}
...: )
...: df4.insert(0, 'date', np.concatenate(dates))
...:
1 loop, best of 3: 583 ms per loop
#jezrael solution 3
In [207]: %%timeit
...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
...: lens = [len(x) for x in dates]
...: r = np.repeat(df.index.values, lens)
...: df2 = pd.DataFrame(df.values[r], df.index[r], df.columns).reset_index(drop=True)
...: df2['start'] = pd.to_datetime(df2['start'])
...: df2['end'] = pd.to_datetime(df2['end'])
...: df2.insert(0, 'date', np.concatenate(dates))
...:
1 loop, best of 3: 609 ms per loop
Code for timings:
import datetime as dt
df = pd.DataFrame()
N = 100
#N = 1
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]*N
df['end'] = [dt.datetime(2017, 8, 10), dt.datetime(2017, 5, 12), dt.datetime(2017, 5, 17)]*N
df['country'] = ['US', 'EU', 'UK']*N
df['letter'] = ['a', 'b', 'c']*N
Upvotes: 6
Reputation: 3852
For your dataframe:
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']
start end country letter
0 2017-04-03 2017-04-10 US a
1 2017-04-05 2017-04-12 EU b
2 2017-04-10 2017-04-17 UK c
First, set new index using pd.date_range
or DatetimeIndex()
, with limits first and last values:
index = pd.date_range(df['start'].iloc[0], df['end'].iloc[-1])
# or
index = pd.DatetimeIndex(start=df['start'].iloc[0], end=df['end'].iloc[-1], freq='D')
>>
index = pd.DatetimeIndex(['2017-04-03', '2017-04-04', '2017-04-05', '2017-04-06',
'2017-04-07', '2017-04-08', '2017-04-09', '2017-04-10',
'2017-04-11', '2017-04-12', '2017-04-13', '2017-04-14',
'2017-04-15', '2017-04-16', '2017-04-17'],
dtype='datetime64[ns]', freq='D')
Then reindex()
with method=ffill
and reset_index
and rename
:
df2 = df.set_index(['start']).reindex(index, method='ffill')
df2['Date'] = df2.index
df2.reset_index().rename(columns={'index':'start'})
start end country letter Date
0 2017-04-03 2017-04-10 US a 2017-04-03
1 2017-04-04 2017-04-10 US a 2017-04-04
2 2017-04-05 2017-04-12 EU b 2017-04-05
3 2017-04-06 2017-04-12 EU b 2017-04-06
4 2017-04-07 2017-04-12 EU b 2017-04-07
5 2017-04-08 2017-04-12 EU b 2017-04-08
6 2017-04-09 2017-04-12 EU b 2017-04-09
7 2017-04-10 2017-04-17 UK c 2017-04-10
8 2017-04-11 2017-04-17 UK c 2017-04-11
9 2017-04-12 2017-04-17 UK c 2017-04-12
10 2017-04-13 2017-04-17 UK c 2017-04-13
11 2017-04-14 2017-04-17 UK c 2017-04-14
12 2017-04-15 2017-04-17 UK c 2017-04-15
13 2017-04-16 2017-04-17 UK c 2017-04-16
14 2017-04-17 2017-04-17 UK c 2017-04-17
# Time:
0.009 s
Old answer:
Use, reindex()
on column start
and expand using asfreq()
:
df2 = df.set_index(['start']).asfreq('D').fillna(method='ffill').reset_index()
>>>
start end country letter
0 2017-04-03 2017-04-10 US a
1 2017-04-04 2017-04-10 US a
2 2017-04-05 2017-04-12 EU b
3 2017-04-06 2017-04-12 EU b
4 2017-04-07 2017-04-12 EU b
5 2017-04-08 2017-04-12 EU b
6 2017-04-09 2017-04-12 EU b
7 2017-04-10 2017-04-17 UK c
asfreq
documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.asfreq.html
Upvotes: 1
Reputation: 210942
Inspired by @StephenRauch's solution I'd like to post mine (which is pretty similar):
dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
lens = [len(x) for x in dates]
r = pd.DataFrame(
{col:np.repeat(df[col].values, lens) for col in df.columns}
).assign(date=np.concatenate(dates))
Result:
In [259]: r
Out[259]:
country end letter start date
0 US 2017-04-10 a 2017-04-03 2017-04-03
1 US 2017-04-10 a 2017-04-03 2017-04-04
2 US 2017-04-10 a 2017-04-03 2017-04-05
3 US 2017-04-10 a 2017-04-03 2017-04-06
4 US 2017-04-10 a 2017-04-03 2017-04-07
5 US 2017-04-10 a 2017-04-03 2017-04-10
6 EU 2017-04-12 b 2017-04-05 2017-04-05
7 EU 2017-04-12 b 2017-04-05 2017-04-06
8 EU 2017-04-12 b 2017-04-05 2017-04-07
9 EU 2017-04-12 b 2017-04-05 2017-04-10
10 EU 2017-04-12 b 2017-04-05 2017-04-11
11 EU 2017-04-12 b 2017-04-05 2017-04-12
12 UK 2017-04-17 c 2017-04-10 2017-04-10
13 UK 2017-04-17 c 2017-04-10 2017-04-11
14 UK 2017-04-17 c 2017-04-10 2017-04-12
15 UK 2017-04-17 c 2017-04-10 2017-04-13
16 UK 2017-04-17 c 2017-04-10 2017-04-14
17 UK 2017-04-17 c 2017-04-10 2017-04-17
Upvotes: 8
Reputation: 49812
First we can build the dates you need, while keeping track of the number of days in each row via the list deltas
:
dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')
then use np.repeat
to build up the data matrix with the proper segment lengths:
df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
then insert the dates into the front of the dataframe:
df2 = pd.concat([dates, df2], axis=1)
Test Code:
import pandas as pd
import numpy as np
import datetime as dt
df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5),
dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12),
dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']
dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')
df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
df2 = pd.concat([dates, df2], axis=1)
print(df2)
Results:
date start end country letter
0 2017-04-03 2017-04-03 2017-04-10 US a
1 2017-04-04 2017-04-03 2017-04-10 US a
2 2017-04-05 2017-04-03 2017-04-10 US a
3 2017-04-06 2017-04-03 2017-04-10 US a
4 2017-04-07 2017-04-03 2017-04-10 US a
5 2017-04-10 2017-04-03 2017-04-10 US a
6 2017-04-05 2017-04-05 2017-04-12 EU b
7 2017-04-06 2017-04-05 2017-04-12 EU b
8 2017-04-07 2017-04-05 2017-04-12 EU b
9 2017-04-10 2017-04-05 2017-04-12 EU b
10 2017-04-11 2017-04-05 2017-04-12 EU b
11 2017-04-12 2017-04-05 2017-04-12 EU b
12 2017-04-10 2017-04-10 2017-04-17 UK c
13 2017-04-11 2017-04-10 2017-04-17 UK c
14 2017-04-12 2017-04-10 2017-04-17 UK c
15 2017-04-13 2017-04-10 2017-04-17 UK c
16 2017-04-14 2017-04-10 2017-04-17 UK c
17 2017-04-17 2017-04-10 2017-04-17 UK c
Upvotes: 7