Reputation: 1960
I'm looking to have an "expanding" date range based on the values in a starttime and endcolumn.
If any part of a record occurs in a prior record, I want to return a starttime that is the minimum of the two starttime records and an endtime that is the maximum of the two endtime records.
These will be grouped by an order id
Order starttime endtime RollingStart RollingEnd
1 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150
1 2015-07-01 10:24:43.137 2015-07-01 10:24:43.200 2015-07-01 10:24:43.047 2015-07-01 10:24:43.200
1 2015-07-01 10:24:43.197 2015-07-01 10:24:57.257 2015-07-01 10:24:43.047 2015-07-01 10:24:57.257
1 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470
1 2015-07-01 10:24:57.730 2015-07-01 10:25:13.485 2015-07-01 10:24:57.465 2015-07-01 10:25:13.485
2 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485
So, in the above example, Order 1 has an initial range that runs from 2015-07-01 10:24:43.047 to 2015-07-01 10:24:57.257 and then another one from 2015-07-01 10:24:57.465 to 2015-07-01 10:25:13.485
Note that while the starttimes are in order, the endtimes are not necessarily due to the nature of the data (there are short term events and long term events)
In the end, I only want the last record of each orderid,rolling start combination (so in this case, the last two records
I tried
df['RollingStart'] = np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['starttime']), min(df['starttime'],df['RollingStart']),df['starttime'])
(this obviously doesn't include the order id)
But the error I receive is
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any ideas would be much appreciated
Code to replicate follows:
from io import StringIO
import io
text = """Order starttime endtime
1 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150
1 2015-07-01 10:24:43.137 2015-07-01 10:24:43.200
1 2015-07-01 10:24:43.197 2015-07-01 10:24:57.257
1 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470
1 2015-07-01 10:24:57.730 2015-07-01 10:25:13.485
2 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485"""
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[1, 2])
df['RollingStart'] = np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['start']), min(df['starttime'],df['RollingStart']),df['starttime'])
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[1, 2])
df['RollingStart']=df['starttime']
df['RollingEnd']=df['endtime']
df['RollingStart'] =
np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['starttime']),min(df['starttime'],df['RollingStart']),df['starttime'])
Error is:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 731, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Thanks
Upvotes: 0
Views: 377
Reputation: 25629
Try this:
Version 1
NaT = pd.NaT
df['Rolling2'] = np.where(df['starttime'].shift(-1) > df['endtime'], NaT,'drop')
df['Rolling2'] = df['Rolling2'].shift(1)
df['RollingStart'] = np.where(df['Rolling2'] =='drop',None,df['starttime'])
df['RollingStart'] = pd.to_datetime(df['RollingStart']).ffill()
df['RollingEnd'] = df['endtime']
del df['Rolling2']
Version 2.
df['RollingStart'] = df['starttime']
df['RollingEnd'] = df['endtime']
df['RollingStart'] = np.where(df['RollingEnd'].shift()>= df['starttime'] ,pd.NaT , df['RollingStart'])
df['RollingStart'] = pd.to_datetime(df['RollingStart']).ffill()
Order starttime endtime RollingStart RollingEnd
0 1 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150
1 1 2015-07-01 10:24:43.137 2015-07-01 10:24:43.200 2015-07-01 10:24:43.047 2015-07-01 10:24:43.200
2 1 2015-07-01 10:24:43.197 2015-07-01 10:24:57.257 2015-07-01 10:24:43.047 2015-07-01 10:24:57.257
3 1 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470
4 1 2015-07-01 10:24:57.730 2015-07-01 10:25:13.485 2015-07-01 10:24:57.465 2015-07-01 10:25:13.485
5 2 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485
Upvotes: 0
Reputation: 435
It looks like you are trying to return a value based on a value that isn't set yet,
df['start'] =...conditions... df['start'].shift()
looks to me like you are trying to set a condition on a column that Pandas doesn't know anything about.
If you are just trying to set the "start" value as the latest time in these columns, try building a statement with or statements, or a make a temporary array and use max, if you are just trying to get the latest time
df['start'] = np.where(max(df['enddatetime'],df['startdatetime'],))
If the above is way off, do you have the code to reproduce this df so I can see if I get the same error?
Upvotes: 1