Lucas
Lucas

Reputation: 93

How to create a timeseries from a dataframe of event durations?

I have a dataframe full of bookings for one room (rows: booking_id, check-in date and check-out date that I want to transform into a timeseries indexed by all year days (index: days of year, feature: booked or not).

I have calculated the duration of the bookings, and reindexed the dataframe daily. Now I need to forward-fill the dataframe, but only a limited number of times: the duration of each booking.

Tried iterating through each row with ffill but it applies to the entire dataframe, not to selected rows. Any idea how I can do that?

Here is my code:

import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
      [2, '2019-01-03', '2019-01-07', 4], 
      [3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])

#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])

#create timeseries indexed on check-in date
df2 = df.set_index('check-in')

#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)

I have this:

    booking_id  check-out   duration
2019-01-01  1.0     2019-01-02  1.0
2019-01-02  NaN     NaT     NaN
2019-01-03  2.0     2019-01-07  4.0
2019-01-04  NaN     NaT     NaN
2019-01-05  NaN     NaT     NaN
2019-01-06  NaN     NaT     NaN
2019-01-07  NaN     NaT     NaN
2019-01-08  NaN     NaT     NaN
2019-01-09  NaN     NaT     NaN
2019-01-10  3.0     2019-01-13  3.0
2019-01-11  NaN     NaT     NaN
2019-01-12  NaN     NaT     NaN
2019-01-13  NaN     NaT     NaN

I expect to have:

    booking_id  check-out   duration
2019-01-01  1.0     2019-01-02  1.0
2019-01-02  1.0     2019-01-02      1.0
2019-01-03  2.0     2019-01-07  4.0
2019-01-04  2.0     2019-01-07  4.0
2019-01-05  2.0     2019-01-07  4.0
2019-01-06  2.0     2019-01-07  4.0
2019-01-07  NaN     NaT     NaN
2019-01-08  NaN     NaT     NaN
2019-01-09  NaN     NaT     NaN
2019-01-10  3.0     2019-01-13  3.0
2019-01-11  3.0     2019-01-13  3.0
2019-01-12  3.0     2019-01-13  3.0
2019-01-13  NaN     NaT     NaN

Upvotes: 2

Views: 73

Answers (3)

G.G
G.G

Reputation: 765

def function1(dd: pd.DataFrame):
    num2=int(dd.iat[0,2])
    return dd.combine_first(dd.iloc[:num2,:].ffill())

df1.groupby(df1.duration.gt(0).cumsum(), as_index=0, group_keys=0).apply(function1)


      booking_id   check-out  duration
2019-01-01         1.0  2019-01-02       1.0
2019-01-02         NaN         NaN       NaN
2019-01-03         2.0  2019-01-07       4.0
2019-01-04         2.0  2019-01-07       4.0
2019-01-05         2.0  2019-01-07       4.0
2019-01-06         2.0  2019-01-07       4.0
2019-01-07         NaN         NaN       NaN
2019-01-08         NaN         NaN       NaN
2019-01-09         NaN         NaN       NaN
2019-01-10         3.0  2019-01-13       3.0
2019-01-11         3.0  2019-01-13       3.0
2019-01-12         3.0  2019-01-13       3.0
2019-01-13         NaN         NaN       NaN

Upvotes: 0

Ben Pap
Ben Pap

Reputation: 2579

filluntil = ts['check-out'].ffill()
m = ts.index < filluntil.values

#reshaping the mask to be shame shape as ts
m = np.repeat(m, ts.shape[1]).reshape(ts.shape)

ts = ts.ffill().where(m)

First we create a series where the dates are ffilled. Then we create a mask where the index is less than the filled values. Then we fill based on our mask.

If you want to include the row with the check out date, change m from < to <=

Upvotes: 1

sometimesiwritecode
sometimesiwritecode

Reputation: 3213

I think to "forward-fill the dataframe" you should use pandas interpolate method. Documentation can be found here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html

you can do something like this:

int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')

look at the specific documentation for interpolate, there is a lot of custom functionality you can add with flags to the method.

EDIT:

to do this using the row value in the duration column for each interpolation, this is a bit messy but I think it should work (there may be a less hacky, cleaner solution using some functionality in pandas or another library i am unaware of):

#get rows with nans in them:
nans_df =  df2[df2.isnull()]
#get rows without nans in them:
non_nans_df =  df2[~df2.isnull()]

#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []

#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
    previous_day = nan_index - pd.DateOffset(1)
    #this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
    if previous_day not in non_nans_df.index:
        continue

    date_offset = 0
    #here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
    while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
        date_offset += 1

    #this gets us the last date in the sequence of continuous days with all nan values after this current one. 
    end_sequence_date = nan_index + pd.DateOffset(date_offset)

    #this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate. 
    df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date]) 

    # now we pull the duration value for the first row in our  df_to_interpolate dataframe. 
    limit_val = int(df_to_interpolate['duration'][0])

    #here we interpolate the dataframe using the limit_val
    df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')

    #append df_to_interpolate to our list that gets combined at the end.
    dfs.append(df_to_interpolate)

 #gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value. 
 final_df = pd.concat(dfs)

Upvotes: 1

Related Questions