Reputation: 1591
I have a pandas dataframe:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02 00:00:00,thing1,25,47
1,2004-01-17 00:00:00,thing2,150,8
2,2004-01-29 00:00:00,thing2,150,25
3,2017-07-15 00:00:00,thing3,55,17
3,2016-05-12 00:00:00,thing3,55,47
4,2012-02-23 00:00:00,thing2,150,22
4,2009-10-10 00:00:00,thing1,25,12
4,2014-04-04 00:00:00,thing2,150,2
5,2008-07-09 00:00:00,thing2,150,43
5,2004-01-30 00:00:00,thing1,25,40
5,2004-01-31 00:00:00,thing1,25,22
5,2004-02-01 00:00:00,thing1,25,2
And I have written the following to apply date range columns:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
However this is very slow, so I have been trying to vectorize it:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = find_window_end_date(df['transaction_dt'].values)
However this produces a value error: "Lengths must match to compare". I am new to trying to write vectorized functions from scratch, so I would appreciate any insight into where I am going awry.
EDIT:
Here is the full error message:
ValueError Traceback (most recent call last)
<ipython-input-11-a781075467c5> in <module>()
5 return start_date_period[window_start_date_idx]
6
----> 7 df['window_start_dt'] = find_window_start_date(df['transaction_dt'].values)
8
9 def find_window_end_date(x):
<ipython-input-11-a781075467c5> in find_window_start_date(x)
2
3 def find_window_start_date(x):
----> 4 window_start_date_idx = np.argmax(x < start_date_period.end_time)
5 return start_date_period[window_start_date_idx]
6
C:\Users\AppData\Local\Continuum\anaconda2\lib\site-packages\pandas\core\ops.pyc in wrapper(self, other, axis)
826 if (not is_scalar(lib.item_from_zerodim(other)) and
827 len(self) != len(other)):
--> 828 raise ValueError('Lengths must match to compare')
829
830 if isinstance(other, ABCPeriodIndex):
ValueError: Lengths must match to compare
EDIT:
I ended up finding an edge condition with the original solution when there are collisions on the first/last days of a 30 day window. I have made some changes to get closer to a robust solution now:
start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')
tra = df['transaction_dt'].values[:, None]
idx1 = np.argmax(start_date_range.values < tra, axis=1)
idx2 = np.argmax(end_date_range.values > tra, axis=1)
df['window_start_dt'] = start_date_range[idx1]
df['window_end_dt'] = end_date_range[idx2]
However, this is still not working correctly because it only sets 'window_start_dt' to the lowest/first value in the date range: '2004-01-01'. Good news is it should be faster yet again.
EDIT:
I added an answer with the solution to the date collision issue below based on jezrael's answer
EDIT
Turns out there was still one more edge case. Please see jezrael's answer here for the final solution:Numpy: conditional np.where replace
Upvotes: 1
Views: 432
Reputation: 1591
I actually found a particularly clever fix for the date collision issue which is just a small change from @jezrael's fantastic answer:
It occurred to me that start_date_range and end_date_range have the exact same number of elements and that, in this specific case, you can just reuse the index created from end_date_range and apply it via np.take to start_date_range to get the correct intervals every time:
start_date_range = pd.date_range('2004-01-01 00:00:00', '12-31-2017 00:00:00', freq='30D')
end_date_range = pd.date_range('2004-01-30 23:59:59', '12-31-2017 23:59:59', freq='30D')
tra = df['transaction_dt'].values[:, None]
idx = np.argmax(end_date_range.values > tra, axis=1)
df['window_start_dt'] = np.take(start_date_range, idx)
df['window_end_dt'] = end_date_range[idx]
It is also even faster now!
Upvotes: 2
Reputation: 862681
I believe you can use broadcasting
:
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
tra = df['transaction_dt'].values[:, None]
idx1 = np.argmax(tra < start_date_period.end_time.values, axis=1)
idx2 = np.argmin(tra > end_date_period.start_time.values, axis=1)
df['window_start_dt'] = start_date_period[idx1]
df['window_end_dt'] = end_date_period[idx2]
print (df)
customer_id transaction_dt product price units window_start_dt \
0 1 2004-01-02 thing1 25 47 2004-01-01
1 1 2004-01-17 thing2 150 8 2004-01-01
2 2 2004-01-29 thing2 150 25 2004-01-01
3 3 2017-07-15 thing3 55 17 2017-06-21
4 3 2016-05-12 thing3 55 47 2016-04-27
5 4 2012-02-23 thing2 150 22 2012-02-18
6 4 2009-10-10 thing1 25 12 2009-10-01
7 4 2014-04-04 thing2 150 2 2014-03-09
8 5 2008-07-09 thing2 150 43 2008-07-08
window_end_dt
0 2004-01-30
1 2004-01-30
2 2004-01-30
3 2017-07-20
4 2016-05-26
5 2012-03-18
6 2009-10-30
7 2014-04-07
8 2008-08-06
Upvotes: 1
Reputation: 23753
I'll sidestep the error you are getting for mismatching lengths.
This still uses .apply
so it may not be what you want, but maybe simplifying it will help:
Make a column of pandas.Period's with a month frequency.
df['window'] = df.transaction_dt.apply(pd.Period, freq='M')
> df
=> transaction_dt product price units window
customer_id
1 2004-01-02 thing1 25 47 2004-01
1 2004-01-17 thing2 150 8 2004-01
2 2004-01-29 thing2 150 25 2004-01
3 2017-07-15 thing3 55 17 2017-07
3 2016-05-12 thing3 55 47 2016-05
4 2012-02-23 thing2 150 22 2012-02
4 2009-10-10 thing1 25 12 2009-10
4 2014-04-04 thing2 150 2 2014-04
5 2008-07-09 thing2 150 43 2008-07
Each Period has a start_time
and end_time
attribute which is easy to use
> df.window.dt.start_time
=> customer_id
1 2004-01-01
1 2004-01-01
2 2004-01-01
3 2017-07-01
3 2016-05-01
4 2012-02-01
4 2009-10-01
4 2014-04-01
5 2008-07-01
Name: window, dtype: datetime64[ns]
If you really need a start and end column;
df['start'] = df.window.dt.start_time
df['end'] = df.window.dt.end_time
> df.head()
=> transaction_dt product price units window start end
customer_id
1 2004-01-02 thing1 25 47 2004-01 2004-01-01 2004-01-31
1 2004-01-17 thing2 150 8 2004-01 2004-01-01 2004-01-31
2 2004-01-29 thing2 150 25 2004-01 2004-01-01 2004-01-31
3 2017-07-15 thing3 55 17 2017-07 2017-07-01 2017-07-31
3 2016-05-12 thing3 55 47 2016-05 2016-05-01 2016-05-31
Upvotes: 0