Clover
Clover

Reputation: 53

Python pandas calculate time until value in a column is greater than it is in current period

I have a pandas dataframe in python with several columns and a datetime stamp. I want to create a new column, that calculates the time until output is less than what it is in the current period.

My current table looks something like this:

 datetime               output
 2014-05-01 01:00:00    3
 2014-05-01 01:00:01    2
 2014-05-01 01:00:02    3
 2014-05-01 01:00:03    2
 2014-05-01 01:00:04    1

I'm trying to get my table to have an extra column and look like this:

 datetime               output     secondsuntildecrease
 2014-05-01 01:00:00    3         1
 2014-05-01 01:00:01    2         3
 2014-05-01 01:00:02    3         1
 2014-05-01 01:00:03    2         1
 2014-05-01 01:00:04    1         

thanks in advance!

Upvotes: 3

Views: 1230

Answers (4)

user2285236
user2285236

Reputation:

upper_triangle     = np.triu(df.output.values < df.output.values[:, None])
df['datetime']     = pd.to_datetime(df['datetime'])
df['s_until_dec']  = df['datetime'][upper_triangle.argmax(axis=1)].values - df['datetime']
df.loc[~upper_triangle.any(axis=1), 's_until_dec'] = np.nan
df
             datetime  output           s_until_dec
0 2014-05-01 01:00:00       3              00:00:01
1 2014-05-01 01:00:01       2              00:00:03
2 2014-05-01 01:00:02       3              00:00:01
3 2014-05-01 01:00:03       2              00:00:01
4 2014-05-01 01:00:04       1                   NaT

Here's how it works:

df.output.values < df.output.values[:, None] this creates a pairwise comparison matrix with broadcasting ([:, None] creates a new axis):

df.output.values < df.output.values[:, None]
Out: 
array([[False,  True, False,  True,  True],
       [False, False, False, False,  True],
       [False,  True, False,  True,  True],
       [False, False, False, False,  True],
       [False, False, False, False, False]], dtype=bool)

Here, for example, output[0] is smaller than output[1] so the matrix element for (0, 1) is True. We need the upper triangle so I used np.triu to get the upper triangle of this matrix. argmax() will give me the index of the first True value. If I pass this into iloc, I will get the corresponding date. Except for the last one of course. It has all Falses so I need to replace it with np.nan. .loc part checks that matrix for that case and replaces with np.nan.

Upvotes: 3

piRSquared
piRSquared

Reputation: 294258

Use numpy's outer subtract to get matrix of differences.

Then filter using numpy's triangle function to ensure we only take differences for future times and stay out of the past.

Use numpy's where to make sure we don't get all False

Finally, take the difference in time.

df = pd.DataFrame(
    dict(output=[3, 2, 3, 2, 1],
         datetime=pd.DatetimeIndex(start='2014-05-01 01:00:00', periods=5, freq='S'))
)

gt0 = np.triu(np.subtract.outer(df.output, df.output), 1) > 0
idx = np.where(gt0.any(1), gt0.argmax(1), np.nan)
-(df.datetime - df.loc[idx, 'datetime'].values).dt.total_seconds()

0    1.0
1    3.0
2    1.0
3    1.0
4    NaN
Name: datetime, dtype: float64

Timing

Mine and ayhan's seem the most performant over small sample

enter image description here

ayhan's is best over 10,000 rows

enter image description here

Upvotes: 0

SO44
SO44

Reputation: 1329

Here's a one liner

df['seconds_until'] = df.apply(lambda x: pd.to_datetime(df.loc[(df['output'] < x['output']) & (df['datetime'] > x['datetime']), 'datetime'].min()) - pd.to_datetime(x[
'datetime']), axis=1)

output

              datetime  output  seconds_until
0  2014/05/01 01:00:00       3       00:00:01
1  2014/05/01 01:00:01       2       00:00:03
2  2014/05/01 01:00:02       3       00:00:01
3  2014/05/01 01:00:03       2       00:00:01
4  2014/05/01 01:00:04       1            NaT

Upvotes: 1

Alex
Alex

Reputation: 19104

df = pd.DataFrame([3, 2, 3, 2, 1], index=pd.DatetimeIndex(start='2014-05-01 01:00:00', periods=5, freq='S'), columns=['output'])

def f(s):
    s = s[s & (s.index > s.name)]
    if s.empty:
        return np.nan
    else:
        return (s.index[0] - s.name).total_seconds()

df['secondsuntildecrease'] = df['output'].apply(lambda x: df['output'] < x).apply(f, axis=1)

df

outputs

                     output  secondsuntildecrease
2014-05-01 01:00:00       3                   1.0
2014-05-01 01:00:01       2                   3.0
2014-05-01 01:00:02       3                   1.0
2014-05-01 01:00:03       2                   1.0
2014-05-01 01:00:04       1                   NaN

Upvotes: 2

Related Questions