Eric B
Eric B

Reputation: 1675

expanding a dataframe based on start and end columns (speed)

I have a pandas.DataFrame containing start and end columns, plus a couple of additional columns. I would like to expand this dataframe into a time series that starts at start values and end at end values, but copying my other columns. So far I came up with the following:

import pandas as pd
import datetime as dt

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

data_series = list()
for row in df.itertuples():
    time_range = pd.bdate_range(row.start, row.end)
    s = len(time_range)
    data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))

columns_names = ['date', 'start', 'end', 'country', 'letter']
df = pd.DataFrame(data_series, columns=columns_names)

Starting Dataframe:

       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-05 2017-04-12      EU      b
2 2017-04-10 2017-04-17      UK      c

Desired output:

         date      start        end country letter
0  2017-04-03 2017-04-03 2017-04-10      US      a
1  2017-04-04 2017-04-03 2017-04-10      US      a
2  2017-04-05 2017-04-03 2017-04-10      US      a
3  2017-04-06 2017-04-03 2017-04-10      US      a
4  2017-04-07 2017-04-03 2017-04-10      US      a
5  2017-04-10 2017-04-03 2017-04-10      US      a
6  2017-04-05 2017-04-05 2017-04-12      EU      b
7  2017-04-06 2017-04-05 2017-04-12      EU      b
8  2017-04-07 2017-04-05 2017-04-12      EU      b
9  2017-04-10 2017-04-05 2017-04-12      EU      b
10 2017-04-11 2017-04-05 2017-04-12      EU      b
11 2017-04-12 2017-04-05 2017-04-12      EU      b
12 2017-04-10 2017-04-10 2017-04-17      UK      c
13 2017-04-11 2017-04-10 2017-04-17      UK      c
14 2017-04-12 2017-04-10 2017-04-17      UK      c
15 2017-04-13 2017-04-10 2017-04-17      UK      c
16 2017-04-14 2017-04-10 2017-04-17      UK      c
17 2017-04-17 2017-04-10 2017-04-17      UK      c

Problem with my solution is that when applying it to a much bigger dataframe (mostly in terms of rows), it does not achieve a result fast enough for me. Does anybody have any ideas of how I could improve? I am also considering solutions in numpy.

Upvotes: 9

Views: 2714

Answers (5)

scottmsul
scottmsul

Reputation: 119

Since the goal here is speed, we should vectorize every step. A single for-loop can slow the code by multiple orders-of-magnitude. A vectorized solution is provided below:

cols = list(df.columns)

df['data_id'] = np.arange(0, len(df))

data_id = df['data_id']
start = df['start']
end = df['end']

diff = ((end-start) / np.timedelta64(1, 'D')).astype('int') + 1
repeated_id = np.repeat(data_id, diff)
time_df = pd.DataFrame(data={'data_id': repeated_id})
time_df = pd.merge(left=time_df, right=df[['data_id', 'start']], on=['data_id'])
time_df['day_id'] = np.arange(0, len(time_df))

min_day_id = time_df.groupby('data_id')['day_id'].min().reset_index().rename(columns={'day_id': 'min_day_id'})
time_df = pd.merge(left=time_df, right=min_day_id, on=['data_id'])
days_to_add = (time_df['day_id'] - time_df['min_day_id']) * np.timedelta64(1, 'D')
time_df['date'] = time_df['start'] + days_to_add

time_df = time_df[time_df['date'].dt.dayofweek < 5]

df = pd.merge(left=df, right=time_df[['data_id', 'date']], on=['data_id'])
df = df[['date']+cols]

The vectorized version works as follows:

  • Give each row of the original dataframe an 'id' column (the dataframe index could also work)
  • Create a new dataframe which expands each row to the number of days in the range
  • Enumerate each group of expanded rows with integers 0..N
  • Add this number of days to the start date to get the actual date
  • Filter business days
  • Merge the expanded dataframe with dates back to the original dataframe on the 'id' column

When comparing jezrael's timed runs, the 'original method' took 1.15s on my machine, while the vectorized version took 56.9ms, a 20-fold speedup.

Upvotes: 0

jezrael
jezrael

Reputation: 863341

Timings + 3 another solutions:

#original solution 
In [163]: %%timeit
     ...: data_series = list()
     ...: for row in df.itertuples():
     ...:     time_range = pd.bdate_range(row.start, row.end)
     ...:     s = len(time_range)
     ...:     data_series += (zip(time_range, [row.start]*s, [row.end]*s, [row.country]*s, [row.letter]*s))
     ...: 
     ...: columns_names = ['date', 'start', 'end', 'country', 'letter']
     ...: df3 = pd.DataFrame(data_series, columns=columns_names)
     ...: 
1 loop, best of 3: 634 ms per loop

#Stephen Rauch solution, a bit changed because warnings
In [164]: %%timeit
     ...: dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
     ...:          for row in df[['start', 'end']].iterrows()]
     ...: deltas = [len(x) for x in dates]
     ...: dates = pd.Series(pd.concat(dates).values, name='date')
     ...: df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
     ...: df2['start'] = pd.to_datetime(df2['start'])
     ...: df2['end'] = pd.to_datetime(df2['end'])
     ...: df2 = pd.concat([dates, df2], axis=1)
     ...: 
1 loop, best of 3: 669 ms per loop

#maxu solution
In [165]: %%timeit
     ...: dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
     ...: lens = [len(x) for x in dates]
     ...: r = pd.DataFrame(
     ...:         {col:np.repeat(df[col].values, lens) for col in df.columns}
     ...:     ).assign(date=np.concatenate(dates))
     ...: 
1 loop, best of 3: 609 ms per loop

#jezrael solution1
In [166]: %%timeit
     ...: df1 = pd.concat([pd.Series(r.Index, 
     ...:                            pd.bdate_range(r.start, r.end)) 
     ...:                            for r in df.itertuples()]).reset_index()
     ...: df1.columns = ['date','idx']
     ...: df2 = df1.set_index('idx').join(df).reset_index(drop=True)
     ...: 
1 loop, best of 3: 632 ms per loop

#jezrael solution2 (improved maxu solution)
In [167]: %%timeit
     ...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
     ...: lens = [len(x) for x in dates]
     ...: 
     ...: df4 = pd.DataFrame(
     ...:         {col:np.repeat(df[col].values, lens) for col in df.columns}
     ...:     )
     ...: df4.insert(0, 'date', np.concatenate(dates))
     ...: 
1 loop, best of 3: 583 ms per loop
#jezrael solution 3
In [207]: %%timeit
     ...: dates = [pd.bdate_range(r[0],r[1]) for r in df[['start','end']].values]
     ...: lens = [len(x) for x in dates]
     ...: r = np.repeat(df.index.values, lens)
     ...: df2 = pd.DataFrame(df.values[r], df.index[r], df.columns).reset_index(drop=True)
     ...: df2['start'] = pd.to_datetime(df2['start'])
     ...: df2['end'] = pd.to_datetime(df2['end'])
     ...: df2.insert(0, 'date', np.concatenate(dates))
     ...: 
1 loop, best of 3: 609 ms per loop

Code for timings:

import datetime as dt

df = pd.DataFrame()
N = 100
#N = 1
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]*N
df['end'] = [dt.datetime(2017, 8, 10), dt.datetime(2017, 5, 12), dt.datetime(2017, 5, 17)]*N
df['country'] = ['US', 'EU', 'UK']*N
df['letter'] = ['a', 'b', 'c']*N

Upvotes: 6

Chuck
Chuck

Reputation: 3852

For your dataframe:

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5), dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12), dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-05 2017-04-12      EU      b
2 2017-04-10 2017-04-17      UK      c

First, set new index using pd.date_range or DatetimeIndex(), with limits first and last values:

index = pd.date_range(df['start'].iloc[0], df['end'].iloc[-1])
# or
index = pd.DatetimeIndex(start=df['start'].iloc[0], end=df['end'].iloc[-1], freq='D')

>>
index = pd.DatetimeIndex(['2017-04-03', '2017-04-04', '2017-04-05', '2017-04-06',
               '2017-04-07', '2017-04-08', '2017-04-09', '2017-04-10',
               '2017-04-11', '2017-04-12', '2017-04-13', '2017-04-14',
               '2017-04-15', '2017-04-16', '2017-04-17'],
              dtype='datetime64[ns]', freq='D')

Then reindex() with method=ffill and reset_index and rename:

df2 = df.set_index(['start']).reindex(index, method='ffill')
df2['Date'] = df2.index
df2.reset_index().rename(columns={'index':'start'})

        start        end country letter       Date
0  2017-04-03 2017-04-10      US      a 2017-04-03
1  2017-04-04 2017-04-10      US      a 2017-04-04
2  2017-04-05 2017-04-12      EU      b 2017-04-05
3  2017-04-06 2017-04-12      EU      b 2017-04-06
4  2017-04-07 2017-04-12      EU      b 2017-04-07
5  2017-04-08 2017-04-12      EU      b 2017-04-08
6  2017-04-09 2017-04-12      EU      b 2017-04-09
7  2017-04-10 2017-04-17      UK      c 2017-04-10
8  2017-04-11 2017-04-17      UK      c 2017-04-11
9  2017-04-12 2017-04-17      UK      c 2017-04-12
10 2017-04-13 2017-04-17      UK      c 2017-04-13
11 2017-04-14 2017-04-17      UK      c 2017-04-14
12 2017-04-15 2017-04-17      UK      c 2017-04-15
13 2017-04-16 2017-04-17      UK      c 2017-04-16
14 2017-04-17 2017-04-17      UK      c 2017-04-17

# Time:
0.009 s

Old answer:

Use, reindex() on column start and expand using asfreq():

df2 = df.set_index(['start']).asfreq('D').fillna(method='ffill').reset_index()
>>>
       start        end country letter
0 2017-04-03 2017-04-10      US      a
1 2017-04-04 2017-04-10      US      a
2 2017-04-05 2017-04-12      EU      b
3 2017-04-06 2017-04-12      EU      b
4 2017-04-07 2017-04-12      EU      b
5 2017-04-08 2017-04-12      EU      b
6 2017-04-09 2017-04-12      EU      b
7 2017-04-10 2017-04-17      UK      c

asfreq documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.asfreq.html

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210942

Inspired by @StephenRauch's solution I'd like to post mine (which is pretty similar):

dates = [pd.bdate_range(r[0],r[1]).to_series() for r in df[['start','end']].values]
lens = [len(x) for x in dates]

r = pd.DataFrame(
        {col:np.repeat(df[col].values, lens) for col in df.columns}
    ).assign(date=np.concatenate(dates))

Result:

In [259]: r
Out[259]:
   country        end letter      start       date
0       US 2017-04-10      a 2017-04-03 2017-04-03
1       US 2017-04-10      a 2017-04-03 2017-04-04
2       US 2017-04-10      a 2017-04-03 2017-04-05
3       US 2017-04-10      a 2017-04-03 2017-04-06
4       US 2017-04-10      a 2017-04-03 2017-04-07
5       US 2017-04-10      a 2017-04-03 2017-04-10
6       EU 2017-04-12      b 2017-04-05 2017-04-05
7       EU 2017-04-12      b 2017-04-05 2017-04-06
8       EU 2017-04-12      b 2017-04-05 2017-04-07
9       EU 2017-04-12      b 2017-04-05 2017-04-10
10      EU 2017-04-12      b 2017-04-05 2017-04-11
11      EU 2017-04-12      b 2017-04-05 2017-04-12
12      UK 2017-04-17      c 2017-04-10 2017-04-10
13      UK 2017-04-17      c 2017-04-10 2017-04-11
14      UK 2017-04-17      c 2017-04-10 2017-04-12
15      UK 2017-04-17      c 2017-04-10 2017-04-13
16      UK 2017-04-17      c 2017-04-10 2017-04-14
17      UK 2017-04-17      c 2017-04-10 2017-04-17

Upvotes: 8

Stephen Rauch
Stephen Rauch

Reputation: 49812

First we can build the dates you need, while keeping track of the number of days in each row via the list deltas:

dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
         for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')

then use np.repeat to build up the data matrix with the proper segment lengths:

df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})

then insert the dates into the front of the dataframe:

df2 = pd.concat([dates, df2], axis=1)

Test Code:

import pandas as pd
import numpy as np
import datetime as dt

df = pd.DataFrame()
df['start'] = [dt.datetime(2017, 4, 3), dt.datetime(2017, 4, 5),
               dt.datetime(2017, 4, 10)]
df['end'] = [dt.datetime(2017, 4, 10), dt.datetime(2017, 4, 12),
             dt.datetime(2017, 4, 17)]
df['country'] = ['US', 'EU', 'UK']
df['letter'] = ['a', 'b', 'c']

dates = [pd.Series(pd.bdate_range(row[1].start, row[1].end))
         for row in df[['start', 'end']].iterrows()]
deltas = [len(x) for x in dates]
dates = pd.Series(pd.concat(dates).values, name='date')

df2 = pd.DataFrame(np.repeat(df.values, deltas, axis=0), columns=df.columns)
df2 = df2.astype(dtype={"start": "datetime64", "end": "datetime64"})
df2 = pd.concat([dates, df2], axis=1)
print(df2)

Results:

         date      start        end country letter
0  2017-04-03 2017-04-03 2017-04-10      US      a
1  2017-04-04 2017-04-03 2017-04-10      US      a
2  2017-04-05 2017-04-03 2017-04-10      US      a
3  2017-04-06 2017-04-03 2017-04-10      US      a
4  2017-04-07 2017-04-03 2017-04-10      US      a
5  2017-04-10 2017-04-03 2017-04-10      US      a
6  2017-04-05 2017-04-05 2017-04-12      EU      b
7  2017-04-06 2017-04-05 2017-04-12      EU      b
8  2017-04-07 2017-04-05 2017-04-12      EU      b
9  2017-04-10 2017-04-05 2017-04-12      EU      b
10 2017-04-11 2017-04-05 2017-04-12      EU      b
11 2017-04-12 2017-04-05 2017-04-12      EU      b
12 2017-04-10 2017-04-10 2017-04-17      UK      c
13 2017-04-11 2017-04-10 2017-04-17      UK      c
14 2017-04-12 2017-04-10 2017-04-17      UK      c
15 2017-04-13 2017-04-10 2017-04-17      UK      c
16 2017-04-14 2017-04-10 2017-04-17      UK      c
17 2017-04-17 2017-04-10 2017-04-17      UK      c

Upvotes: 7

Related Questions