noliverte
noliverte

Reputation: 301

difference between multiple dates based on id column in a Pandas dataframe

My practical case is the following, in a store I would like to know visit period (in day) between the 1st visit and the 2nd visit, the 2nd and the 3rd,...

I have a python dataset with 2 columns (visit IDs for each customer and a date of visit)

data = {'Id': ['A', 'B','A','B','A','A'],
'Date': ['01/03/2022', '03/03/2022', '05/03/2022', '07/03/2022', '09/03/2022','11/03/2022']
}

enter image description here

My question : How many days are there between the 1st visit and the 2nd visit for customers who have come 4 times? same question between the 2nd and the 3rd visit...

Upvotes: 1

Views: 1339

Answers (2)

rcshon
rcshon

Reputation: 927

You can first cast them to pd.Timestamp objects and sort the ids and dates by

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Id', 'Date'])

Then, you can simply calculate the difference in the timestamps, multiplied by the boolean mask of whether it is the same user:

df['Date Since Last Visit'] = (df['Date'] - df['Date'].shift(1)) * (df['Id'] == df['Id'].shift())
df['Date Since Last Visit'] = df['Date Since Last Visit'].fillna(pd.Timedelta(0))

Which will give you the timedelta objects that you can see how many days it has been.

Your output will be:


    Id  Date        Date Since Last Visit
0   A   2022-03-01  0 days
2   A   2022-03-05  4 days
4   A   2022-03-09  4 days
5   A   2022-03-11  2 days
1   B   2022-03-03  0 days
3   B   2022-03-07  4 days

Or just filter out the first visits after:

df['Date Since Last Visit'] = df['Date'] - df['Date'].shift(1)
df = df[df['Id'] == df['Id'].shift()]

Which gives you:

    Id  Date        Date Since Last Visit
2   A   2022-03-05  4 days
4   A   2022-03-09  4 days
5   A   2022-03-11  2 days
3   B   2022-03-07  4 days

Upvotes: 1

mozway
mozway

Reputation: 260410

The output you expect is unclear, but let's compute a 2D table with the customers as index and the visits as columns:

# convert to datetime (assuming DD/MM/YYYY)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

# setting up group
g = df.groupby('Id')['Date']

df2 = (df
 .sort_values(by='Id')      # ensure the dates are ordered
 .assign(visit=g.cumcount().add(1),      # visit number
         nb_visits=g.transform('count'), # total number of visits
         diff=g.diff()                   # difference between successive visits
         )
 .query('nb_visits >= 4')   # filter to keep customers with at least 4 visits
 .pivot(index='Id', columns='visit', values='diff')  # reshape to 2D
)

output:

This gives the number of days since the previous visit for each customer (if more that 3 visits)

visit   1      2      3      4
Id                            
A     NaT 4 days 4 days 2 days

NB. you could remove the first column here as it will always be undefined (NaT)

Upvotes: 1

Related Questions