sayan sen
sayan sen

Reputation: 31

calculate different between consecutive date records at an ID level

I have a dataframe as

   col 1     col 2
    A         2020-07-13 
    A         2020-07-15 
    A         2020-07-18 
    A         2020-07-19
    B         2020-07-13
    B         2020-07-19 
    C         2020-07-13
    C         2020-07-18 
   

I want it to become the following in a new dataframe

    col_3     diff_btw_1st_2nd_date   diff_btw_2nd_3rd_date  diff_btw_3rd_4th_date
     A              2                       3                       1
     B              6                       NaN                     NaN
     C              5                       NaN                     NaN

I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?

Upvotes: 0

Views: 26

Answers (2)

Bill Huang
Bill Huang

Reputation: 4648

You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.

Solution

df["col 2"] = pd.to_datetime(df["col 2"])

# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
    .pivot(index="col 1", columns="cumcount")\
    .drop(columns=[("diff", 0)])

Result

# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)

       d2-d1  d3-d2  d4-d3
col 1                     
A        2.0    3.0    1.0
B        6.0    NaN    NaN
C        5.0    NaN    NaN

df after assing cumcount is like this

print(df)
  col 1      col 2  diff  cumcount
0     A 2020-07-13   NaN         0
1     A 2020-07-15   2.0         1
2     A 2020-07-18   3.0         2
3     A 2020-07-19   1.0         3
4     B 2020-07-13   NaN         0
5     B 2020-07-19   6.0         1
6     C 2020-07-13   NaN         0
7     C 2020-07-18   5.0         1

Upvotes: 0

jezrael
jezrael

Reputation: 862671

Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:

df['col 2'] = pd.to_datetime(df['col 2'])

df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
        .unstack()
        .diff(axis=1)
        .iloc[:, 1:]
        .apply(lambda x: x.dt.days)
        .add_prefix('diff_')
        .reset_index())
print (df)
  col 1  diff_1  diff_2  diff_3
0     A       2     3.0     1.0
1     B       6     NaN     NaN
2     C       5     NaN     NaN

Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:

df['col 2'] = pd.to_datetime(df['col 2'])

df = (df.assign(g = df.groupby('col 1').cumcount(),
               c1 = df.groupby('col 1')['col 2'].diff().dt.days)
       .dropna(subset=['c1'])
       .pivot('col 1','g','c1')
       .add_prefix('diff_')
       .rename_axis(None, axis=1)
       .reset_index())

print (df)
  col 1  diff_1  diff_2  diff_3
0     A     2.0     3.0     1.0
1     B     6.0     NaN     NaN
2     C     5.0     NaN     NaN

Upvotes: 1

Related Questions