kroonike
kroonike

Reputation: 1129

Pandas DataFrame, calculating days left until the next specified date?

I currently have a pandas DataFrame that looks similar to this:

              Data
2018-07-02    15.60
2018-07-03    16.14
2018-07-09    12.69
2018-07-17    12.06
2018-07-19    12.87
2018-07-27    13.03
2018-08-01    13.15
2018-08-02    12.19
2018-08-07    10.93
2018-08-09    11.27
2018-08-10    13.16
2018-08-16    13.45
2018-08-17    12.64
2018-08-20    12.49
2018-08-21    12.86
2018-08-22    12.25

I also have a list of dates:

dates = [datetime.datetime(2018, 8, 20, 0, 0),
         datetime.datetime(2018, 8, 1, 0, 0),
         datetime.datetime(2018, 7, 2, 0, 0)]

I am looking to add another column to my DataFrame called 'Count'. It starts the count from the latest date in dates list and resets it on when datetimeindex in my DataFrame matches next date in the list. The end result should look like this:

              Data     Count     
2018-07-02    15.60    1
2018-07-03    16.14    6 
2018-07-09    12.69    5
2018-07-17    12.06    4
2018-07-19    12.87    3
2018-07-27    13.03    2
2018-08-01    13.15    1
2018-08-02    12.19    7
2018-08-07    10.93    6
2018-08-09    11.27    5
2018-08-10    13.16    4
2018-08-16    13.45    3
2018-08-17    12.64    2 
2018-08-20    12.49    1

What is the most efficient way to do it on a large DataFrame?

Upvotes: 1

Views: 544

Answers (2)

Raviteja Ainampudi
Raviteja Ainampudi

Reputation: 282

Write a function which finds the minimum difference of time events between the DateTimeIndex and the dates list. The apply map on the index of the dataframe to obtain the new column.

You can use the below code in appropriate manner to your requirements.

def time_diff(i):
    dates = [datetime.datetime(2018, 8, 20, 0, 0),
             datetime.datetime(2018, 8, 1, 0, 0),
             datetime.datetime(2018, 7, 2, 0, 0)]
    diff_values = [ j - i for j in dates if j - i[0] != '-']
    if len(diff_values) > 0:
        return min(diff_values)

dataframe["Count"] = dataframe.index.map(time_diff)

Upvotes: 0

jezrael
jezrael

Reputation: 862591

Use GroupBy.cumcount by helper Series created by isin and cumsum and for swap ordering add [::-1]:

s = pd.Series(df.index.isin(dates), index=df.index)[::-1].cumsum()
df['Count'] = df.groupby(s).cumcount(ascending=False) + 1
print (df)
             Data  Count
2018-07-02  15.60      1
2018-07-03  16.14      6
2018-07-09  12.69      5
2018-07-17  12.06      4
2018-07-19  12.87      3
2018-07-27  13.03      2
2018-08-01  13.15      1
2018-08-02  12.19      7
2018-08-07  10.93      6
2018-08-09  11.27      5
2018-08-10  13.16      4
2018-08-16  13.45      3
2018-08-17  12.64      2
2018-08-20  12.49      1

Upvotes: 2

Related Questions