Reputation: 33
I need to solve a data science problem using only Python and Pandas where the given input is the date of check in, the date of check out, and the payments of every books from a single person
check_in check_out payment
2020-02-28 2020-03-02 66
2020-02-27 2020-02-29 100
I need to show how much daily earnings I got from that 2 books earlier. I got the idea that I have to split it for each date like 66/3 and distribute it from 02-28 to 03-02 and 100/2 then distribute it from 02-27 to 02-29 for the second order, then I will get result like this
date earnings
2020-02-27 50
2020-02-28 22+50
2020-02-29 22
2020-03-01 22
Which will be like this if I summed them up
date earnings
2020-02-27 50
2020-02-28 72
2020-02-29 22
2020-03-01 22
I have already divided the payment column from input and create a new dataframe with two new columns which represent how long did the guests stay using day_stay and how much earnings did I receive with daily_earn using the code below
df["date_stay"] = abs(df["check_in"] - df["check_out"]) # Get difference
df["date_stay"] = pd.to_numeric(df["date_stay"].dt.days) # Turn to days
df["daily_earn"] = df["payment"]/df["date_stay"]
Then the result will be look like this
check_in check_out payment day_stay daily_earn
2020-02-28 2020-03-02 66 3 22
2020-02-27 2020-02-29 100 2 50
The last step is to distribute the value in daily_earn to the corresponding date from 02-27 to 03-01 but I don't have any idea to do that. I have tried to use df.groupby("check_in").sum() but it didn't give the desired output since the number of rows between the input and output is different. Any ideas?
Upvotes: 2
Views: 433
Reputation: 35626
Let's try a different approach:
df['date'] = df.apply(
lambda r: pd.date_range(r['check_in'], r['check_out'], closed='left'),
axis=1)
df = df.explode('date')
df['earnings'] = df['payment'] / df.groupby(level=0)['date'].transform('count')
df = df.groupby('date', as_index=False)['earnings'].agg('sum')
df
:
date earnings
0 2020-02-27 50.0
1 2020-02-28 72.0
2 2020-02-29 22.0
3 2020-03-01 22.0
Breakdown of steps:
apply
pd.date_range
to each row to get the days between start and end:
df['date'] = df.apply(
lambda r: pd.date_range(r['check_in'], r['check_out'], closed='left'),
axis=1)
check_in check_out payment date
0 2020-02-28 2020-03-02 66 DatetimeIndex(['2020-02-28', '2020-02-29', '2020-03-01'], dtype='datetime64[ns]', freq='D')
1 2020-02-27 2020-02-29 100 DatetimeIndex(['2020-02-27', '2020-02-28'], dtype='datetime64[ns]', freq='D')
Then explode
the date
into rows:
df = df.explode('date')
check_in check_out payment date
0 2020-02-28 2020-03-02 66 2020-02-28
0 2020-02-28 2020-03-02 66 2020-02-29
0 2020-02-28 2020-03-02 66 2020-03-01
1 2020-02-27 2020-02-29 100 2020-02-27
1 2020-02-27 2020-02-29 100 2020-02-28
Then groupby transform
count date
to get the number of dates and divide the payment by the number of days to get the daily earnings:
df['earnings'] = df['payment'] / df.groupby(level=0)['date'].transform('count')
check_in check_out payment date earnings
0 2020-02-28 2020-03-02 66 2020-02-28 22.0
0 2020-02-28 2020-03-02 66 2020-02-29 22.0
0 2020-02-28 2020-03-02 66 2020-03-01 22.0
1 2020-02-27 2020-02-29 100 2020-02-27 50.0
1 2020-02-27 2020-02-29 100 2020-02-28 50.0
Then groupby agg
sum the earnings on the date
to get the total per date
:
df = df.groupby('date', as_index=False)['earnings'].agg('sum')
date earnings
0 2020-02-27 50.0
1 2020-02-28 72.0
2 2020-02-29 22.0
3 2020-03-01 22.0
DataFrame and imports used:
import pandas as pd
df = pd.DataFrame({'check_in': {0: '2020-02-28', 1: '2020-02-27'},
'check_out': {0: '2020-03-02', 1: '2020-02-29'},
'payment': {0: 66, 1: 100}})
df['check_in'] = pd.to_datetime(df['check_in'])
df['check_out'] = pd.to_datetime(df['check_out'])
Upvotes: 3