Reputation: 155
I am trying to calculate, what is the sum of payment made 2nd half of year minus the 1st half of the year.
This is how the data may look:
ID date payment
1 1/1/2020 10
1 1/2/2020 11
1 1/3/2020 10
1 1/4/2020 10
1 1/5/2020 11
1 1/6/2020 10
1 1/7/2020 10
1 1/8/2020 11
1 1/9/2020 10
1 1/10/2020 32
1 1/11/2020 10
1 1/12/2020 12
2 1/1/2020 10
2 1/2/2020 10
2 1/3/2020 41
2 1/4/2020 10
2 1/5/2020 53
2 1/6/2020 10
2 1/7/2020 10
2 1/8/2020 44
2 1/9/2020 10
2 1/10/2020 2
2 1/11/2020 9
2 1/12/2020 5
I convert the df date to a pandas dt
df.date = df.date.astype(str).str.slice(0, 10)
df.date = pd.to_datetime(pay.date)
print(df.date.min(),df.date.max())
output: 2020-01-01 00:00:00 2020-12-01 00:00:00
Then i create time points and different data frames for 1st and 2nd half of the year
observation_date = '2020-12-31'
observation_date = datetime.strptime(observation_date, '%Y-%m-%d')
observation_date = observation_date.date()
observation_date = pd.Timestamp(observation_date)
print(observation_date)
mo6_ago = observation_date - relativedelta(months=6)
mo6_ago = pd.Timestamp(mo6_ago)
print(mo6_ago)
mo6_ago_plus1 = observation_date - relativedelta(months=6) + relativedelta(days=1)
mo6_ago_plus1 = pd.Timestamp(mo6_ago_plus1)
print(mo6_ago_plus1)
mo12_ago = observation_date - relativedelta(months=12) + relativedelta(days=1)
mo12_ago = pd.Timestamp(mo12_ago)
print(mo12_ago)
output:
2020-12-31 00:00:00
2020-06-30 00:00:00
2020-07-01 00:00:00
2020-01-01 00:00:00
mask = (df['date'] >= mo12_ago) & (df['date'] <= mo6_ago)
first_half = df.loc[mask]
first_half = first_half[['ID','date','payment']]
print(first_half.date.min(),first_half.date.max())
output: 2020-01-01 00:00:00 2020-06-01 00:00:00
mask = (df['date'] >= mo6_ago_plus1) & (df['date'] <= observation_date)
sec_half = df.loc[mask]
sec_half = sec_half[['ID','date','payment']]
print(sec_half.date.min(),sec_half.date.max())
output: 2020-07-01 00:00:00 2020-12-01 00:00:00
then i group and sum for the 2 half of the year and merge them into one df like that
sum_first_half = first_half.groupby(['ID'])['payment'].sum().reset_index()
sum_first_half = sum_first_half.rename(columns = {'payment':'payment_first_half'})
sum_sec_half = sec_half.groupby(['ID'])['payment'].sum().reset_index()
sum_sec_half = sum_sec_half.rename(columns = {'payment':'payment_sec_half'})
df_new = pd.merge(sum_first_half, sum_sec_half, how='outer', on='ID')
Finally i take minus the 2 columns this way
df_new['sec_minus_first'] = df_new['payment_sec_half'] -df_new['payment_first_half']
ID payment_first_half payment_sec_half sec_minus_first
1 62 85 23
2 134 80 -54
Is there a faster and more memory efficient way of doing this?
Upvotes: 3
Views: 256
Reputation: 2786
Using datetime
:
from datetime import datetime as dt
Convert date column to datetime:
df["date"] = pd.to_datetime(df["date"])
Split on a date of your choice, group by ID, sum each half, then subtract the halves:
df.loc[df['date'] >= dt(2020, 7, 1)].groupby("ID").sum() - df.loc[df['date'] < dt(2020, 7, 1)].groupby("ID").sum()
Upvotes: 2