Fed
Fed

Reputation: 93

Python pandas dataframes: filtering around key dates

I have a pandas date-frame df with the index being a daily DatetimeIndex, and an attached column with historical_sales.

If we wanted to filter days in the past where historical_sales were greater than a large number, say 200, easy enough:

df.loc[df['historical_sales'>200]]

I'm wondering however, what if we wanted to explore sales patterns on 5 days preceding AND following days when sales were > 200?

Many thanks.

Upvotes: 1

Views: 365

Answers (4)

jezrael
jezrael

Reputation: 862681

I think youn need get all index values by list comprehension and then select by loc.

Also is necessary use numpy.concatenate for join all indexes together with numpy.unique for remove duplicates.

np.random.seed(100)
rng = pd.date_range('2017-04-03', periods=20)
df = pd.DataFrame({'historical_sales': np.random.choice([100,200,300], size=20)}, index=rng)
print (df)
            historical_sales
2017-04-03               100
2017-04-04               100
2017-04-05               100
2017-04-06               300
2017-04-07               300
2017-04-08               100
2017-04-09               300
2017-04-10               200
2017-04-11               300
2017-04-12               300
2017-04-13               300
2017-04-14               300
2017-04-15               200
2017-04-16               100
2017-04-17               100
2017-04-18               100
2017-04-19               100
2017-04-20               300
2017-04-21               100
2017-04-22               200

idxmask = df.index[df['historical_sales']>200]
print (idxmask)
DatetimeIndex(['2017-04-06', '2017-04-07', '2017-04-09', '2017-04-11',
               '2017-04-12', '2017-04-13', '2017-04-14', '2017-04-20'],
              dtype='datetime64[ns]', freq=None)

#in real data change 1 to 5 for 5 days
temp_index = [df.loc[timestamp - pd.Timedelta(1, unit='d') : 
                     timestamp + pd.Timedelta(1, unit='d')].index for timestamp in idxmask]
idx = np.unique(np.concatenate(temp_index))

df1 = df.loc[idx]
print (df1)
            historical_sales
2017-04-05               100
2017-04-06               300
2017-04-07               300
2017-04-08               100
2017-04-09               300
2017-04-10               200
2017-04-11               300
2017-04-12               300
2017-04-13               300
2017-04-14               300
2017-04-15               200
2017-04-19               100
2017-04-20               300
2017-04-21               100

Upvotes: 1

Abhishek
Abhishek

Reputation: 3417

For the sake of clarity, I am adding a new column which is set to 1 for the row of interest. And for the easy verification the number of window dates have been kept to 1 instead of 5, in the code below

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

df = pd.DataFrame(data=np.random.rand(51),index=pd.date_range('2015-04-20','2015-06-09'),columns=['A'])
idx = df[df.A >0.5].index

df["new"] = 0

for date in idx:
    current_date = date.to_pydatetime()
    start = current_date - timedelta(days=1)
    end = current_date + timedelta(days=1)

    df.loc[start:current_date]["new"] = 1
    df.loc[current_date:end]["new"] = 1


print(df)

Upvotes: 0

Prakriti Gupta
Prakriti Gupta

Reputation: 111

When I need to work with rows before and after, I simply perform a shift.

df['preceeding_5th_day'] = df['historical_sales'].shift(5)
df['following_5th_day'] = df['historical_sales'].shift(-5)

Then, you can simply make your check and do

df.loc[df['historical_sales'>200]]

The selected rows then will also have columns about preceeding and following 5th day. Pretty straightforward this way.

Upvotes: 0

Christian Sauer
Christian Sauer

Reputation: 10899

You would want to do range slicing: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position

Should look like this (code is pseudocode):

great_sales_df = df.loc[df['historical_sales'>200]]
for sale in great_sales_df:
   sales_date = great_sales_df["date"]
   sales_before = sales_date + pd.DateOffset(-5)
   sales_after =  sales_date + pd.DateOffset(+5)
   pattern_df = df.iloc[sales_before:sales_after]

This code will not work, but I think the direction is right.

Upvotes: 0

Related Questions