Reputation: 301
My practical case is the following, in a store I would like to know visit period (in day) between the 1st visit and the 2nd visit, the 2nd and the 3rd,...
I have a python dataset with 2 columns (visit IDs for each customer and a date of visit)
data = {'Id': ['A', 'B','A','B','A','A'],
'Date': ['01/03/2022', '03/03/2022', '05/03/2022', '07/03/2022', '09/03/2022','11/03/2022']
}
My question : How many days are there between the 1st visit and the 2nd visit for customers who have come 4 times? same question between the 2nd and the 3rd visit...
Upvotes: 1
Views: 1339
Reputation: 927
You can first cast them to pd.Timestamp
objects and sort the ids and dates by
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Id', 'Date'])
Then, you can simply calculate the difference in the timestamps, multiplied by the boolean mask of whether it is the same user:
df['Date Since Last Visit'] = (df['Date'] - df['Date'].shift(1)) * (df['Id'] == df['Id'].shift())
df['Date Since Last Visit'] = df['Date Since Last Visit'].fillna(pd.Timedelta(0))
Which will give you the timedelta objects that you can see how many days it has been.
Your output will be:
Id Date Date Since Last Visit
0 A 2022-03-01 0 days
2 A 2022-03-05 4 days
4 A 2022-03-09 4 days
5 A 2022-03-11 2 days
1 B 2022-03-03 0 days
3 B 2022-03-07 4 days
Or just filter out the first visits after:
df['Date Since Last Visit'] = df['Date'] - df['Date'].shift(1)
df = df[df['Id'] == df['Id'].shift()]
Which gives you:
Id Date Date Since Last Visit
2 A 2022-03-05 4 days
4 A 2022-03-09 4 days
5 A 2022-03-11 2 days
3 B 2022-03-07 4 days
Upvotes: 1
Reputation: 260410
The output you expect is unclear, but let's compute a 2D table with the customers as index and the visits as columns:
# convert to datetime (assuming DD/MM/YYYY)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
# setting up group
g = df.groupby('Id')['Date']
df2 = (df
.sort_values(by='Id') # ensure the dates are ordered
.assign(visit=g.cumcount().add(1), # visit number
nb_visits=g.transform('count'), # total number of visits
diff=g.diff() # difference between successive visits
)
.query('nb_visits >= 4') # filter to keep customers with at least 4 visits
.pivot(index='Id', columns='visit', values='diff') # reshape to 2D
)
output:
This gives the number of days since the previous visit for each customer (if more that 3 visits)
visit 1 2 3 4
Id
A NaT 4 days 4 days 2 days
NB. you could remove the first column here as it will always be undefined (NaT)
Upvotes: 1