Reputation: 357

Creating new rows in df based on date range

Below is script for a simplified version of the df in question:

import pandas as pd
import numpy as np
from datetime import date
from datetime import datetime

df = pd.DataFrame({'date':pd.date_range(start='2020-01-01', end='2020/01/07'), 
                   'id' : range(1,8), 
                   'product': ['list_3','list_1','list_2', 'list_3','list_2','list_1','list_1'],
                  'duration' : [3,1,2,3,2,1,1],
                  'product_in_use': ('true'),
                  'end_date':['2020-01-03','2020-01-02','2020-01-04','2020-01-06','2020-01-06','2020-01-06',
                                '2020-01-07']})

df['duration']= pd.to_timedelta(df['duration'], unit='D')
df['date'] = pd.to_datetime(df['date'])
df['end_date'] = pd.to_datetime(df['end_date'])
df

df:

    date       id   product duration product_in_use end_date
0   2020-01-01  1   list_3  3 days     true        2020-01-03
1   2020-01-02  2   list_1  1 days     true        2020-01-02
2   2020-01-03  3   list_2  2 days     true        2020-01-04
3   2020-01-04  4   list_3  3 days     true        2020-01-06
4   2020-01-05  5   list_2  2 days     true        2020-01-06
5   2020-01-06  6   list_1  1 days     true        2020-01-06
6   2020-01-07  7   list_1  1 days     true        2020-01-07

As you can see in the df above, each id is using a product and each product lasts for a particular duration. There are no rows for the days when the product is in action, only rows for when the product was purchased by the user.

Therefore, I would like to create new rows, for all dates where the product is in use for each id.

My intended df would therefore be this:

    date       id   product  duration   product_in_use
0   2020-01-01  1   list_3   3 days         true
1   2020-01-02  1   list_3   3 days         true
2   2020-01-03  1   list_3   3 days         true
3   2020-01-02  2   list_1   1 days         true
4   2020-01-03  3   list_2   2 days         true
5   2020-01-04  3   list_2   2 days         true
6   2020-01-04  4   list_3   3 days         true
7   2020-01-05  4   list_3   3 days         true
8   2020-01-06  4   list_3   3 days         true
9   2020-01-05  5   list_2   3 days         true
10  2020-01-06  5   list_2   2 days         true
11  2020-01-06  6   list_1   2 days         true
12  2020-01-07  7   list_1   1 days         true

Upvotes: 4

Answers (4)

Amjad Al Taleb

Reputation: 151

Create a different DataFrame, then execute an outer join to add the new rows.

Upvotes: 1

sammywemmy

Reputation: 28644

create date ranges from the start date to the end date for each id, using starmap and chain, expand the main dataframe by the duration, and assign the new dates as index of the dataframe.

from itertools import starmap,chain

#create date ranges from date to end_date for each id
start_end = zip(df.date.array,df.end_date.array)
date_ranges = starmap(pd.date_range,start_end)
date_ranges = chain.from_iterable(date_ranges)

#get all columns except date and end_date
res = df.filter(['id','product','duration','product_in_use'])

#expand the dataframe by repeating the indexes based on the duration
#so index 0 will be repeated 3 times, 1 once, 2 twice, ...
res = res.reindex(res.index.repeat(res.duration.dt.days))

#assign the new date_ranges to the dataframe
res.index = date_ranges
res

           id   product duration    product_in_use
2020-01-01  1   list_3    3 days    true
2020-01-02  1   list_3    3 days    true
2020-01-03  1   list_3    3 days    true
2020-01-02  2   list_1    1 days    true
2020-01-03  3   list_2    2 days    true
2020-01-04  3   list_2    2 days    true
2020-01-04  4   list_3    3 days    true
2020-01-05  4   list_3    3 days    true
2020-01-06  4   list_3    3 days    true
2020-01-05  5   list_2    2 days    true
2020-01-06  5   list_2    2 days    true
2020-01-06  6   list_1    1 days    true
2020-01-07  7   list_1    1 days    true

Upvotes: 2

Anshul

Reputation: 1413

If you do not convert 'duration' field to timedelta then this worked for me:

df1 = pd.DataFrame()

for idx in df.index:
    print(idx, df['duration'][idx])
    for i in range(df['duration'][idx]):
        temp_df = df[idx:idx+1]
        temp_df['date'] = pd.to_datetime(temp_df['date']) + timedelta(days=i)
        df1 = df1.append(temp_df)

df1.reset_index(inplace=True)
df1.drop(['end_date', 'index'], axis=1, inplace=True)

print(df1)

Output:

         date  id product  duration product_in_use
0  2020-01-01   1  list_3         3           true
1  2020-01-02   1  list_3         3           true
2  2020-01-03   1  list_3         3           true
3  2020-01-02   2  list_1         1           true
4  2020-01-03   3  list_2         2           true
5  2020-01-04   3  list_2         2           true
6  2020-01-04   4  list_3         3           true
7  2020-01-05   4  list_3         3           true
8  2020-01-06   4  list_3         3           true
9  2020-01-05   5  list_2         2           true
10 2020-01-06   5  list_2         2           true
11 2020-01-06   6  list_1         1           true
12 2020-01-07   7  list_1         1           true

Upvotes: 1

BALAJI R

Reputation: 129

s = pd.concat(pd.Series(r.Index, pd.date_range(r.date,r.end_date))
              for r in df.itertuples())

df1 = df.loc[s].assign(date=s.index).reset_index(drop=True)

print(df1)

Upvotes: 1

Creating new rows in df based on date range

Answers (4)

Related Questions