Pylander
Pylander

Reputation: 1591

Create Date Interval Fields from List of Dates

I have a list of arbitrarily picked dates:

dates = ['01/01/2017','01/30/2017','2/28/2017'] etc.

I also have a pandas dataframe with a 'transaction_dt' field:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43

What I would like to do is to create a function/apply or lambda to compare the transaction_dt field values to the date list values and then create two new fields called 'begin_dt' and 'end_dt' for the intervals that the 'transaction_dt' falls between.

EDIT:

I actually think it might make things much simpler and flexible to have two separate lists for this application:

start_dates = ['2004-01-01','2004-01-31','2004-03-01','2004-03-31']
end_dates = ['2004-01-30','2004-02-29','2004-03-30','2004-4-29']

Upvotes: 0

Views: 77

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40878

Using pandas and itertools (will yield NaT when no true interval is found):

from itertools import islice
import numpy as np
import pandas as pd

# df = pd.read_clipboard(parse_dates=['transaction_dt'], 
#                        dtype={'product': 'category'})
dates = ['01/01/2004', '01/30/2004', '2/28/2004']
dates = pd.to_datetime(dates).sort_values()


def window(seq, n=2):
    # From old cookbook:
    # https://docs.python.org/release/2.3.5/lib/itertools-example.html
    # This gets us valid (start, end) date combinations
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result


_between_dates = lambda tgt, date1, date2: date1 < tgt < date2


def between_dates(tgt, dates):
    res = [np.nan] * 2  # default
    for d1, d2 in window(dates):
        if _between_dates(tgt, d1, d2):
            res = [d1, d2]
    return res


between = pd.DataFrame(df.transaction_dt.apply(
    lambda x: between_dates(x, dates)).values.tolist(),
    columns=['begin_dt', 'end_dt'])

df = pd.concat((df, between), axis=1)

Result:

   customer_id transaction_dt product  price  units   begin_dt     end_dt
0            1     2004-01-02  thing1     25     47 2004-01-01 2004-01-30
1            1     2004-01-17  thing2    150      8 2004-01-01 2004-01-30
2            2     2004-01-29  thing2    150     25 2004-01-01 2004-01-30
3            3     2017-07-15  thing3     55     17        NaT        NaT
4            3     2016-05-12  thing3     55     47        NaT        NaT
5            4     2012-02-23  thing2    150     22        NaT        NaT
6            4     2009-10-10  thing1     25     12        NaT        NaT
7            4     2014-04-04  thing2    150      2        NaT        NaT
8            5     2008-07-09  thing2    150     43        NaT        NaT

Addressing your edit

You'd want to make these changes:

# Don't need to define `window` or import islice

# Consider using `pd.date_range()` here
# Also may want to confirm that no dates overlap
#     and estbalish some logic on what constitutes "falls in."
#     ("On or after" versus "after")
start_dates = pd.to_datetime(start_dates)
end_dates  = pd.to_datetime(end_dates)


def _between_dates(tgt, date1, date2):
    # Better form to define with `def` than `lambda` here
    return date1 < tgt < date2


def between_dates(tgt, start_dates, end_dates):
    res = [np.nan] * 2  # default
    for d1, d2 in zip(start_dates, end_dates):
        if _between_dates(tgt, d1, d2):
            res = [d1, d2]
    return res

between = pd.DataFrame(df.transaction_dt.apply(
    lambda x: between_dates(x, start_dates, end_dates)).values.tolist(),
    columns=['begin_dt', 'end_dt'])

Upvotes: 1

Related Questions