Reputation: 53
I have a pandas dataframe in python with several columns and a datetime stamp. I want to create a new column, that calculates the time until output is less than what it is in the current period.
My current table looks something like this:
datetime output
2014-05-01 01:00:00 3
2014-05-01 01:00:01 2
2014-05-01 01:00:02 3
2014-05-01 01:00:03 2
2014-05-01 01:00:04 1
I'm trying to get my table to have an extra column and look like this:
datetime output secondsuntildecrease
2014-05-01 01:00:00 3 1
2014-05-01 01:00:01 2 3
2014-05-01 01:00:02 3 1
2014-05-01 01:00:03 2 1
2014-05-01 01:00:04 1
thanks in advance!
Upvotes: 3
Views: 1230
Reputation:
upper_triangle = np.triu(df.output.values < df.output.values[:, None])
df['datetime'] = pd.to_datetime(df['datetime'])
df['s_until_dec'] = df['datetime'][upper_triangle.argmax(axis=1)].values - df['datetime']
df.loc[~upper_triangle.any(axis=1), 's_until_dec'] = np.nan
df
datetime output s_until_dec
0 2014-05-01 01:00:00 3 00:00:01
1 2014-05-01 01:00:01 2 00:00:03
2 2014-05-01 01:00:02 3 00:00:01
3 2014-05-01 01:00:03 2 00:00:01
4 2014-05-01 01:00:04 1 NaT
Here's how it works:
df.output.values < df.output.values[:, None]
this creates a pairwise comparison matrix with broadcasting ([:, None]
creates a new axis):
df.output.values < df.output.values[:, None]
Out:
array([[False, True, False, True, True],
[False, False, False, False, True],
[False, True, False, True, True],
[False, False, False, False, True],
[False, False, False, False, False]], dtype=bool)
Here, for example, output[0]
is smaller than output[1]
so the matrix element for (0, 1) is True. We need the upper triangle so I used np.triu
to get the upper triangle of this matrix. argmax()
will give me the index of the first True
value. If I pass this into iloc, I will get the corresponding date. Except for the last one of course. It has all False
s so I need to replace it with np.nan
. .loc
part checks that matrix for that case and replaces with np.nan
.
Upvotes: 3
Reputation: 294258
Use numpy's outer subtract to get matrix of differences.
Then filter using numpy's triangle function to ensure we only take differences for future times and stay out of the past.
Use numpy's where to make sure we don't get all False
Finally, take the difference in time.
df = pd.DataFrame(
dict(output=[3, 2, 3, 2, 1],
datetime=pd.DatetimeIndex(start='2014-05-01 01:00:00', periods=5, freq='S'))
)
gt0 = np.triu(np.subtract.outer(df.output, df.output), 1) > 0
idx = np.where(gt0.any(1), gt0.argmax(1), np.nan)
-(df.datetime - df.loc[idx, 'datetime'].values).dt.total_seconds()
0 1.0
1 3.0
2 1.0
3 1.0
4 NaN
Name: datetime, dtype: float64
Mine and ayhan's seem the most performant over small sample
ayhan's is best over 10,000 rows
Upvotes: 0
Reputation: 1329
Here's a one liner
df['seconds_until'] = df.apply(lambda x: pd.to_datetime(df.loc[(df['output'] < x['output']) & (df['datetime'] > x['datetime']), 'datetime'].min()) - pd.to_datetime(x[
'datetime']), axis=1)
output
datetime output seconds_until
0 2014/05/01 01:00:00 3 00:00:01
1 2014/05/01 01:00:01 2 00:00:03
2 2014/05/01 01:00:02 3 00:00:01
3 2014/05/01 01:00:03 2 00:00:01
4 2014/05/01 01:00:04 1 NaT
Upvotes: 1
Reputation: 19104
df = pd.DataFrame([3, 2, 3, 2, 1], index=pd.DatetimeIndex(start='2014-05-01 01:00:00', periods=5, freq='S'), columns=['output'])
def f(s):
s = s[s & (s.index > s.name)]
if s.empty:
return np.nan
else:
return (s.index[0] - s.name).total_seconds()
df['secondsuntildecrease'] = df['output'].apply(lambda x: df['output'] < x).apply(f, axis=1)
df
outputs
output secondsuntildecrease
2014-05-01 01:00:00 3 1.0
2014-05-01 01:00:01 2 3.0
2014-05-01 01:00:02 3 1.0
2014-05-01 01:00:03 2 1.0
2014-05-01 01:00:04 1 NaN
Upvotes: 2