Reputation: 1960

Pandas Cumulative Time Series Range in Data Frame

I'm looking to have an "expanding" date range based on the values in a starttime and endcolumn.

If any part of a record occurs in a prior record, I want to return a starttime that is the minimum of the two starttime records and an endtime that is the maximum of the two endtime records.

These will be grouped by an order id

Order starttime             endtime                 RollingStart            RollingEnd
1   2015-07-01 10:24:43.047 2015-07-01 10:24:43.150 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150
1   2015-07-01 10:24:43.137 2015-07-01 10:24:43.200 2015-07-01 10:24:43.047 2015-07-01 10:24:43.200
1   2015-07-01 10:24:43.197 2015-07-01 10:24:57.257 2015-07-01 10:24:43.047 2015-07-01 10:24:57.257
1   2015-07-01 10:24:57.465 2015-07-01 10:25:13.470 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470
1   2015-07-01 10:24:57.730 2015-07-01 10:25:13.485 2015-07-01 10:24:57.465 2015-07-01 10:25:13.485
2   2015-07-01 10:48:57.465 2015-07-01 10:48:13.485 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485

So, in the above example, Order 1 has an initial range that runs from 2015-07-01 10:24:43.047 to 2015-07-01 10:24:57.257 and then another one from 2015-07-01 10:24:57.465 to 2015-07-01 10:25:13.485

Note that while the starttimes are in order, the endtimes are not necessarily due to the nature of the data (there are short term events and long term events)

In the end, I only want the last record of each orderid,rolling start combination (so in this case, the last two records

I tried

df['RollingStart'] = np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['starttime']), min(df['starttime'],df['RollingStart']),df['starttime'])

(this obviously doesn't include the order id)

But the error I receive is

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any ideas would be much appreciated

Code to replicate follows:

from io import StringIO
import io

text = """Order   starttime               endtime
1       2015-07-01 10:24:43.047  2015-07-01 10:24:43.150
1       2015-07-01 10:24:43.137  2015-07-01 10:24:43.200
1       2015-07-01 10:24:43.197  2015-07-01 10:24:57.257
1       2015-07-01 10:24:57.465  2015-07-01 10:25:13.470
1       2015-07-01 10:24:57.730  2015-07-01 10:25:13.485
2       2015-07-01 10:48:57.465  2015-07-01 10:48:13.485"""

df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[1, 2])
df['RollingStart'] = np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['start']), min(df['starttime'],df['RollingStart']),df['starttime'])




df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[1, 2])


df['RollingStart']=df['starttime']
df['RollingEnd']=df['endtime']
df['RollingStart'] = 
np.where((df['endtime'] >= df['RollingStart'].shift()) & (df['RollingEnd'].shift()>= df['starttime']),min(df['starttime'],df['RollingStart']),df['starttime'])

Error is:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 731, in     __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Thanks

Upvotes: 0

Answers (2)

Merlin

Reputation: 25699

Try this:

Version 1

NaT = pd.NaT
df['Rolling2']     = np.where(df['starttime'].shift(-1) > df['endtime'], NaT,'drop')
df['Rolling2']     = df['Rolling2'].shift(1)
df['RollingStart'] = np.where(df['Rolling2']  =='drop',None,df['starttime'])
df['RollingStart'] = pd.to_datetime(df['RollingStart']).ffill()
df['RollingEnd']   = df['endtime']
del df['Rolling2']

Version 2.

df['RollingStart'] = df['starttime']
df['RollingEnd']   = df['endtime']
df['RollingStart'] = np.where(df['RollingEnd'].shift()>= df['starttime'] ,pd.NaT , df['RollingStart'])
df['RollingStart'] = pd.to_datetime(df['RollingStart']).ffill()


  Order               starttime                 endtime            RollingStart              RollingEnd
0      1 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150 2015-07-01 10:24:43.047 2015-07-01 10:24:43.150
1      1 2015-07-01 10:24:43.137 2015-07-01 10:24:43.200 2015-07-01 10:24:43.047 2015-07-01 10:24:43.200
2      1 2015-07-01 10:24:43.197 2015-07-01 10:24:57.257 2015-07-01 10:24:43.047 2015-07-01 10:24:57.257
3      1 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470 2015-07-01 10:24:57.465 2015-07-01 10:25:13.470
4      1 2015-07-01 10:24:57.730 2015-07-01 10:25:13.485 2015-07-01 10:24:57.465 2015-07-01 10:25:13.485
5      2 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485 2015-07-01 10:48:57.465 2015-07-01 10:48:13.485

Upvotes: 0

Vincent Buscarello

Reputation: 435

It looks like you are trying to return a value based on a value that isn't set yet,

df['start'] =...conditions... df['start'].shift()

looks to me like you are trying to set a condition on a column that Pandas doesn't know anything about.

If you are just trying to set the "start" value as the latest time in these columns, try building a statement with or statements, or a make a temporary array and use max, if you are just trying to get the latest time

df['start'] = np.where(max(df['enddatetime'],df['startdatetime'],))

If the above is way off, do you have the code to reproduce this df so I can see if I get the same error?

Upvotes: 1

Pandas Cumulative Time Series Range in Data Frame

Answers (2)

Related Questions