user
user

Reputation: 771

How to determine the difference in days in a list of dates?

I have a dataframe containing a column which contains a list of dates. The length of dates can range (2+ dates). I was hoping to create a new column that contains the number of days between the minimum and maximum of dates in this list and am not entirely sure the best way to do this? Any help would be greatly appreciated!

data = [
    ["Item_1", ["2020-06-01", "2020-06-02", "2020-07-05"]],
    ["Item_2", ["2018-04-15", "2018-04-22"]],
    ["Item_3", ["2016-02-15", "2016-02-22", "2016-03-05", "2016-04-01"]],
]

df = pd.DataFrame(data, columns=["Item_ID", "Dates"])
df

Upvotes: 2

Views: 393

Answers (2)

wwnde
wwnde

Reputation: 26686

There are many ways

Option 1: Keep it numpy and one liner

df['Lapse'] =df.agg(lambda x: np.ptp(np.array(x['Dates'], dtype='datetime64')), axis=1)

Option 2: Go the long way

  1. Explode

  2. Coerce date to date time

  3. Find differences of the extremes using np.ptpt

    df=df.explode('Dates')
         df['Dates']=pd.to_datetime(df['Dates'], format='%d,%m,%Y')
         df.groupby('Item_ID').agg(lapse= ('Dates', np.ptp), Dates=('Dates', list))
    

Upvotes: 1

Henry Ecker
Henry Ecker

Reputation: 35686

We can Series.explode the Dates column, convert to_datetime, then groupby agg to find the min and max dates per group, take the diff of each group, and assign the result back to a new column:

df['Duration'] = (
    # explode lists into usable Series and convert to Datetime
    pd.to_datetime(df['Dates'].explode())
        .groupby(level=0).agg(['min', 'max'])  # Get min and max per group
        .diff(axis=1)  # Diff across rows
        .iloc[:, -1]  # Get the resulting difference
)

If the lists are guaranteed to be sorted, we can simply subtract the last value in the list from the first to get the duration after converting to_datetime:

df['Duration'] = (
    # get last value in list and subtract from first value 
    # after converting each to datetime
    pd.to_datetime(df['Dates'].str[-1]) - pd.to_datetime(df['Dates'].str[0])
)

Both options produce df:

  Item_ID                                             Dates Duration
0  Item_1              [2020-06-01, 2020-06-02, 2020-07-05]  34 days
1  Item_2                          [2018-04-15, 2018-04-22]   7 days
2  Item_3  [2016-02-15, 2016-02-22, 2016-03-05, 2016-04-01]  46 days

Upvotes: 3

Related Questions