hernanavella
hernanavella

Reputation: 5552

Advanced Slicing of Intervals in Pandas Dataframe

I need to slice several intervals out of one dataframe indexed with Freq: 120T. The start date of each of the desired intervals is given by a second dataframe indexed with Freq: None. The idea is that I need to take each of those start dates and include a time and # of periods to build each interval. The time attribute and the periods are the same for all intervals.

Let's look at this mess with an example

Say our start time is '18:00:00' and our # of periods is 3. The dataframe to be sliced is df1 and the dataframe that contains the start dates is df2.

df1

                      A   B     
DateTime                                               
2005-09-06 16:00:00   1   5  
2005-09-06 18:00:00   2   6  
2005-09-06 20:00:00   3   7  
2005-09-06 22:00:00   4   8 
2005-12-07 16:00:00   9   8  
2005-12-07 18:00:00   7   6  
2005-12-07 20:00:00   5   4  
2005-12-07 22:00:00   3   2  

<class 'pandas.tseries.index.DatetimeIndex'>
[2005-09-06 16:00:00, ..., 2005-12-07 22:00:00]
Length: 8, Freq: 120T, Timezone: None

df2

             Num
DateTime                                                                    
2005-09-07     1
2005-12-07     2

<class 'pandas.tseries.index.DatetimeIndex'>
[2005-09-07, 2005-12-07]
Length: 2, Freq: None, Timezone: None

Desired Output:

df3 = func(source=df1['B'], start_dates=df2.index, time_start='18:00:00', periods=3)

           1   2
18:00:00   6   6  
20:00:00   7   4  
22:00:00   8   2 

What have I done and considerations:

One of the difficulties is that the data in df1 is in 120T frequency but is only business days. Taking this into account, I would do something like this:

start = df2.index[0]   ##  And somehow add to this formula the fact that we want to start at         
                           '18:00'
df3 = df1['B'][(df1.index > start) & (df1.index < start + 3)]  ##  Somehow iterate this over the 
                                                                   dates in the df2 index

I appreciate any insight

Thanks in advance

Upvotes: 2

Views: 2622

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375535

A completely different approach:

def next_n_asof(x, t, n):
    """The next n rows after time t in x
    """
    i = np.argmax(df.index >= t)
    return x[i:i + n]

In [11]: next_n_asof(df.B, pd.Timestamp('2005-09-06 18:00:00'), 3)
Out[11]:
2005-09-06 18:00:00    6
2005-09-06 20:00:00    7
2005-09-06 22:00:00    8
Name: B, dtype: int64

We can use this in a concat for each day in the index:

In [12]: pd.concat(next_n_asof(df.B, t, 3)
                   for t in df2.index + pd.tseries.timedeltas.to_timedelta(18, unit='h'))
Out[12]:
2005-09-06 18:00:00    6
2005-09-06 20:00:00    7
2005-09-06 22:00:00    8
2005-12-07 18:00:00    6
2005-12-07 20:00:00    4
2005-12-07 22:00:00    2
Name: B, dtype: int64

we had to add the time to the dates in df2.index:

In [13]: df2.index + pd.tseries.timedeltas.to_timedelta(18, unit='h')
Out[13]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2005-09-06 18:00:00, 2005-12-07 18:00:00]
Length: 2, Freq: None, Timezone: None

Note: I couldn't get this to work cleanly with asof itself... which may be more efficient.

Upvotes: 1

Andy Hayden
Andy Hayden

Reputation: 375535

You're looking for pivot... at least once you've extracted the rows you're interested in.

Those that are on the same date, use normalize and isin:

In [11]: res = df.loc[df.index.normalize().isin(df2.index), 'B']

In [12]: res
Out[12]:
2005-09-06 16:00:00    5
2005-09-06 18:00:00    6
2005-09-06 20:00:00    7
2005-09-06 22:00:00    8
2005-12-07 16:00:00    8
2005-12-07 18:00:00    6
2005-12-07 20:00:00    4
2005-12-07 22:00:00    2
Name: B, dtype: int64

Once it's in that form pivot away (if there is likely to be missing data you may have to use pivot_table which is a bit more flexible)!

In [14]: pd.pivot(res.index.time, res.index.normalize(), res.values)
Out[14]:
          2005-09-06  2005-12-07
16:00:00           5           8
18:00:00           6           6
20:00:00           7           4
22:00:00           8           2

The "meat" of the row selection using isin, checking if the time, normalized to midnight, is contained in df2.index.

df.index.normalize().isin(df2.index)

If we also care about a time we can use indexer_between_time:

In [15]: df.ix[df.index.indexer_between_time('18:00', '00:00'), 'B']
Out[15]:
2005-09-06 18:00:00    6
2005-09-06 20:00:00    7
2005-09-06 22:00:00    8
2005-12-07 18:00:00    6
2005-12-07 20:00:00    4
2005-12-07 22:00:00    2
Name: B, dtype: int64

Ok, in this example these are the same (as there is only the dates we want!), but in general you really want both these conditions (to "and" them)...

# I had tried to make this a one-liner but utterly failed!
in_time = np.zeros(len(df), dtype=bool)
in_time[df.index.indexer_between_time('18:00', '00:00')] = True
res = df.loc[df.index.normalize().isin(df2.index) & in_time, 'B']

In [17]: res
Out[17]:
2005-09-06 16:00:00    5
2005-09-06 18:00:00    6
2005-09-06 20:00:00    7
2005-09-06 22:00:00    8
2005-12-07 16:00:00    8
2005-12-07 18:00:00    6
Name: B, dtype: int64

You can map the columns of the pivoted result:

In [21]: pv = pd.pivot(res.index.time, res.index.normalize(), res.values)

In [22]: pv
Out[22]:
          2005-09-06  2005-12-07
18:00:00           6           6
20:00:00           7           4
22:00:00           8           2

In [23]: pv.columns = pv.columns.map(df2.Num.get)

In [24]: pv
Out[24]:
          1  2
18:00:00  6  6
20:00:00  7  4
22:00:00  8  2

voila.

Upvotes: 2

Related Questions