Reputation: 1591
I have a list of arbitrarily picked dates:
dates = ['01/01/2017','01/30/2017','2/28/2017'] etc.
I also have a pandas dataframe with a 'transaction_dt' field:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
What I would like to do is to create a function/apply or lambda to compare the transaction_dt field values to the date list values and then create two new fields called 'begin_dt' and 'end_dt' for the intervals that the 'transaction_dt' falls between.
EDIT:
I actually think it might make things much simpler and flexible to have two separate lists for this application:
start_dates = ['2004-01-01','2004-01-31','2004-03-01','2004-03-31']
end_dates = ['2004-01-30','2004-02-29','2004-03-30','2004-4-29']
Upvotes: 0
Views: 77
Reputation: 40878
Using pandas and itertools (will yield NaT when no true interval is found):
from itertools import islice
import numpy as np
import pandas as pd
# df = pd.read_clipboard(parse_dates=['transaction_dt'],
# dtype={'product': 'category'})
dates = ['01/01/2004', '01/30/2004', '2/28/2004']
dates = pd.to_datetime(dates).sort_values()
def window(seq, n=2):
# From old cookbook:
# https://docs.python.org/release/2.3.5/lib/itertools-example.html
# This gets us valid (start, end) date combinations
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
_between_dates = lambda tgt, date1, date2: date1 < tgt < date2
def between_dates(tgt, dates):
res = [np.nan] * 2 # default
for d1, d2 in window(dates):
if _between_dates(tgt, d1, d2):
res = [d1, d2]
return res
between = pd.DataFrame(df.transaction_dt.apply(
lambda x: between_dates(x, dates)).values.tolist(),
columns=['begin_dt', 'end_dt'])
df = pd.concat((df, between), axis=1)
Result:
customer_id transaction_dt product price units begin_dt end_dt
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-30
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-30
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-30
3 3 2017-07-15 thing3 55 17 NaT NaT
4 3 2016-05-12 thing3 55 47 NaT NaT
5 4 2012-02-23 thing2 150 22 NaT NaT
6 4 2009-10-10 thing1 25 12 NaT NaT
7 4 2014-04-04 thing2 150 2 NaT NaT
8 5 2008-07-09 thing2 150 43 NaT NaT
You'd want to make these changes:
# Don't need to define `window` or import islice
# Consider using `pd.date_range()` here
# Also may want to confirm that no dates overlap
# and estbalish some logic on what constitutes "falls in."
# ("On or after" versus "after")
start_dates = pd.to_datetime(start_dates)
end_dates = pd.to_datetime(end_dates)
def _between_dates(tgt, date1, date2):
# Better form to define with `def` than `lambda` here
return date1 < tgt < date2
def between_dates(tgt, start_dates, end_dates):
res = [np.nan] * 2 # default
for d1, d2 in zip(start_dates, end_dates):
if _between_dates(tgt, d1, d2):
res = [d1, d2]
return res
between = pd.DataFrame(df.transaction_dt.apply(
lambda x: between_dates(x, start_dates, end_dates)).values.tolist(),
columns=['begin_dt', 'end_dt'])
Upvotes: 1