Reputation: 5552
I need to slice several intervals out of one dataframe indexed with Freq: 120T. The start date of each of the desired intervals is given by a second dataframe indexed with Freq: None. The idea is that I need to take each of those start dates and include a time and # of periods to build each interval. The time attribute and the periods are the same for all intervals.
Let's look at this mess with an example
Say our start time is '18:00:00' and our # of periods is 3. The dataframe to be sliced is df1 and the dataframe that contains the start dates is df2.
df1
A B
DateTime
2005-09-06 16:00:00 1 5
2005-09-06 18:00:00 2 6
2005-09-06 20:00:00 3 7
2005-09-06 22:00:00 4 8
2005-12-07 16:00:00 9 8
2005-12-07 18:00:00 7 6
2005-12-07 20:00:00 5 4
2005-12-07 22:00:00 3 2
<class 'pandas.tseries.index.DatetimeIndex'>
[2005-09-06 16:00:00, ..., 2005-12-07 22:00:00]
Length: 8, Freq: 120T, Timezone: None
df2
Num
DateTime
2005-09-07 1
2005-12-07 2
<class 'pandas.tseries.index.DatetimeIndex'>
[2005-09-07, 2005-12-07]
Length: 2, Freq: None, Timezone: None
Desired Output:
df3 = func(source=df1['B'], start_dates=df2.index, time_start='18:00:00', periods=3)
1 2
18:00:00 6 6
20:00:00 7 4
22:00:00 8 2
What have I done and considerations:
One of the difficulties is that the data in df1 is in 120T frequency but is only business days. Taking this into account, I would do something like this:
start = df2.index[0] ## And somehow add to this formula the fact that we want to start at
'18:00'
df3 = df1['B'][(df1.index > start) & (df1.index < start + 3)] ## Somehow iterate this over the
dates in the df2 index
I appreciate any insight
Thanks in advance
Upvotes: 2
Views: 2622
Reputation: 375535
A completely different approach:
def next_n_asof(x, t, n):
"""The next n rows after time t in x
"""
i = np.argmax(df.index >= t)
return x[i:i + n]
In [11]: next_n_asof(df.B, pd.Timestamp('2005-09-06 18:00:00'), 3)
Out[11]:
2005-09-06 18:00:00 6
2005-09-06 20:00:00 7
2005-09-06 22:00:00 8
Name: B, dtype: int64
We can use this in a concat for each day in the index:
In [12]: pd.concat(next_n_asof(df.B, t, 3)
for t in df2.index + pd.tseries.timedeltas.to_timedelta(18, unit='h'))
Out[12]:
2005-09-06 18:00:00 6
2005-09-06 20:00:00 7
2005-09-06 22:00:00 8
2005-12-07 18:00:00 6
2005-12-07 20:00:00 4
2005-12-07 22:00:00 2
Name: B, dtype: int64
we had to add the time to the dates in df2.index:
In [13]: df2.index + pd.tseries.timedeltas.to_timedelta(18, unit='h')
Out[13]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2005-09-06 18:00:00, 2005-12-07 18:00:00]
Length: 2, Freq: None, Timezone: None
Note: I couldn't get this to work cleanly with asof itself... which may be more efficient.
Upvotes: 1
Reputation: 375535
You're looking for pivot... at least once you've extracted the rows you're interested in.
Those that are on the same date, use normalize
and isin
:
In [11]: res = df.loc[df.index.normalize().isin(df2.index), 'B']
In [12]: res
Out[12]:
2005-09-06 16:00:00 5
2005-09-06 18:00:00 6
2005-09-06 20:00:00 7
2005-09-06 22:00:00 8
2005-12-07 16:00:00 8
2005-12-07 18:00:00 6
2005-12-07 20:00:00 4
2005-12-07 22:00:00 2
Name: B, dtype: int64
Once it's in that form pivot away (if there is likely to be missing data you may have to use pivot_table
which is a bit more flexible)!
In [14]: pd.pivot(res.index.time, res.index.normalize(), res.values)
Out[14]:
2005-09-06 2005-12-07
16:00:00 5 8
18:00:00 6 6
20:00:00 7 4
22:00:00 8 2
The "meat" of the row selection using isin, checking if the time, normalized to midnight, is contained in df2.index.
df.index.normalize().isin(df2.index)
If we also care about a time we can use indexer_between_time
:
In [15]: df.ix[df.index.indexer_between_time('18:00', '00:00'), 'B']
Out[15]:
2005-09-06 18:00:00 6
2005-09-06 20:00:00 7
2005-09-06 22:00:00 8
2005-12-07 18:00:00 6
2005-12-07 20:00:00 4
2005-12-07 22:00:00 2
Name: B, dtype: int64
Ok, in this example these are the same (as there is only the dates we want!), but in general you really want both these conditions (to "and" them)...
# I had tried to make this a one-liner but utterly failed!
in_time = np.zeros(len(df), dtype=bool)
in_time[df.index.indexer_between_time('18:00', '00:00')] = True
res = df.loc[df.index.normalize().isin(df2.index) & in_time, 'B']
In [17]: res
Out[17]:
2005-09-06 16:00:00 5
2005-09-06 18:00:00 6
2005-09-06 20:00:00 7
2005-09-06 22:00:00 8
2005-12-07 16:00:00 8
2005-12-07 18:00:00 6
Name: B, dtype: int64
You can map the columns of the pivoted result:
In [21]: pv = pd.pivot(res.index.time, res.index.normalize(), res.values)
In [22]: pv
Out[22]:
2005-09-06 2005-12-07
18:00:00 6 6
20:00:00 7 4
22:00:00 8 2
In [23]: pv.columns = pv.columns.map(df2.Num.get)
In [24]: pv
Out[24]:
1 2
18:00:00 6 6
20:00:00 7 4
22:00:00 8 2
voila.
Upvotes: 2