green diod
green diod

Reputation: 1499

pandas dataframe by boolean value, by index, and by integer

I have a similar problem to the one here (dataframe by index and by integer)

What I want is to get part of the DataFrame by a boolean indexing (easy) and look at a few values backward, say at the previous index and possibly a few more. Unfortunately, the suggested answer in the linked question with get_loc makes my code snippet choke (type error in the following snippet) before I can get the actual integer locations.

Taking the same example as in the answer in the other question, here is what I tried:

df = pd.DataFrame(index=pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2015,2,1)), data={'a':np.arange(32)})
df.index.get_loc(df.index[df['a'] == 1])
*** TypeError: Cannot convert input to TimeStamp

The previous answer used a string for get_loc where I'd just like to pass a plain index value (here a DateTime)

Upvotes: 1

Views: 3117

Answers (1)

unutbu
unutbu

Reputation: 880717

Using a slice:

import numpy as np
import pandas as pd
import datetime as DT
index = pd.date_range(start=DT.datetime(2015,1,1), end=DT.datetime(2015,2,1))
df = pd.DataFrame({'a':np.arange(len(index))}, index=index)

mask = df['a'] == 1
idx = np.flatnonzero(mask)[0]
lookback = 3
print(df.iloc[max(idx-lookback, 0):idx+1])

yields

             a
2015-01-08   7
2015-01-09   8
2015-01-10   9
2015-01-11  10

Note that if idx-lookback is negative, then the index refers to elements near the tail of df, just like with Python lists:

In [163]: df.iloc[-3:2]
Out[163]: 
Empty DataFrame
Columns: [a]
Index: []

In [164]: df.iloc[0:2]
Out[164]: 
            a
2015-01-01  0
2015-01-02  1

Thus, to grab elements relative to the head of df, use max(idx-lookback, 0).


Using a boolean mask:

As you know, if you have a boolean array or boolean Series such as

mask = df['a'] == 10

you can select the corresponding rows with

df.loc[mask]

If you wish to select previous or succeeding rows shifted by a fixed amount, you could use mask.shift to shift the mask:

df.loc[mask.shift(-lookback).fillna(False)]

If you wish to select lookback preceeding rows, then you could expand the mask by unioning it with its shifts:

lookback = 3
for i in range(1, lookback):
    mask |= mask.shift(-i)

or, equivalently, use cumsum:

mask = (mask.shift(-lookback) - mask.shift(1)).cumsum().fillna(False).astype(bool)

The for-loop is clearer, but the cumsum expression is faster, particularly if lookback is large.


For example,

import numpy as np
import pandas as pd
import datetime as DT
df = pd.DataFrame(
    index=pd.date_range(start=DT.datetime(2015,1,1), end=DT.datetime(2015,2,1)), 
    data={'a':np.arange(32)})

mask = df['a'] == 10
lookback = 3
for i in range(1, lookback):
    mask |= mask.shift(-i)

# alternatively,
# mask = (mask.shift(-lookback) - mask.shift(1)).cumsum().fillna(False).astype(bool)

print(df.loc[mask])

yields

             a
2015-01-08   7
2015-01-09   8
2015-01-10   9
2015-01-11  10

Upvotes: 2

Related Questions