IceAsher Chew
IceAsher Chew

Reputation: 155

Pandas groupby time and ID and aggregate

I am trying to calculate, what is the sum of payment made 2nd half of year minus the 1st half of the year.

This is how the data may look:

   ID   date    payment
    1   1/1/2020    10
    1   1/2/2020    11
    1   1/3/2020    10
    1   1/4/2020    10
    1   1/5/2020    11
    1   1/6/2020    10
    1   1/7/2020    10
    1   1/8/2020    11
    1   1/9/2020    10
    1   1/10/2020   32
    1   1/11/2020   10
    1   1/12/2020   12
    2   1/1/2020    10
    2   1/2/2020    10
    2   1/3/2020    41
    2   1/4/2020    10
    2   1/5/2020    53
    2   1/6/2020    10
    2   1/7/2020    10
    2   1/8/2020    44
    2   1/9/2020    10
    2   1/10/2020   2
    2   1/11/2020   9
    2   1/12/2020   5

I convert the df date to a pandas dt

df.date = df.date.astype(str).str.slice(0, 10)
df.date = pd.to_datetime(pay.date)

print(df.date.min(),df.date.max()) 
output: 2020-01-01 00:00:00 2020-12-01 00:00:00

Then i create time points and different data frames for 1st and 2nd half of the year

observation_date = '2020-12-31'
observation_date = datetime.strptime(observation_date, '%Y-%m-%d')
observation_date = observation_date.date()
observation_date = pd.Timestamp(observation_date)
print(observation_date)         

mo6_ago = observation_date - relativedelta(months=6) 
mo6_ago = pd.Timestamp(mo6_ago)
print(mo6_ago)

mo6_ago_plus1 = observation_date - relativedelta(months=6) + relativedelta(days=1)
mo6_ago_plus1 = pd.Timestamp(mo6_ago_plus1)
print(mo6_ago_plus1)

mo12_ago = observation_date - relativedelta(months=12) + relativedelta(days=1)
mo12_ago = pd.Timestamp(mo12_ago)
print(mo12_ago)

output:
2020-12-31 00:00:00
2020-06-30 00:00:00
2020-07-01 00:00:00
2020-01-01 00:00:00

mask = (df['date'] >= mo12_ago) & (df['date'] <= mo6_ago)
first_half = df.loc[mask]
first_half = first_half[['ID','date','payment']]
print(first_half.date.min(),first_half.date.max())

output: 2020-01-01 00:00:00 2020-06-01 00:00:00

mask = (df['date'] >= mo6_ago_plus1) & (df['date'] <= observation_date)
sec_half = df.loc[mask]
sec_half = sec_half[['ID','date','payment']]
print(sec_half.date.min(),sec_half.date.max())

output: 2020-07-01 00:00:00 2020-12-01 00:00:00

then i group and sum for the 2 half of the year and merge them into one df like that

sum_first_half = first_half.groupby(['ID'])['payment'].sum().reset_index()
sum_first_half = sum_first_half.rename(columns = {'payment':'payment_first_half'})

sum_sec_half = sec_half.groupby(['ID'])['payment'].sum().reset_index()
sum_sec_half = sum_sec_half.rename(columns = {'payment':'payment_sec_half'})

df_new = pd.merge(sum_first_half, sum_sec_half, how='outer', on='ID')

Finally i take minus the 2 columns this way

df_new['sec_minus_first'] = df_new['payment_sec_half'] -df_new['payment_first_half']

ID  payment_first_half  payment_sec_half    sec_minus_first
1         62                    85               23
2        134                    80              -54

Is there a faster and more memory efficient way of doing this?

Upvotes: 3

Views: 256

Answers (1)

noah
noah

Reputation: 2786

Using datetime:

from datetime import datetime as dt

Convert date column to datetime:

df["date"] = pd.to_datetime(df["date"])

Split on a date of your choice, group by ID, sum each half, then subtract the halves:

df.loc[df['date'] >= dt(2020, 7, 1)].groupby("ID").sum() - df.loc[df['date'] < dt(2020, 7, 1)].groupby("ID").sum()

Upvotes: 2

Related Questions