How to groupby multiple columns and aggregate diff on different columns?

Question

I am looking for help here on how to do this in Python / Panda:

I am looking to take the original data (below) and find the daily difference of multiple cols (cnt_a and cnt_b) by a group with multiple cols (state, county and date).

I've been trying it different ways, and I can't seem to get by the "check for duplicate" issue

df.cnt_a = df.sort_values(['state','county','date']).groupby['state','county','date','cnt_a'].diff(-1)

Tried splitting it out to fix one thing at a time:

df1 = df.sort_values(['state','county','date'])

df2 = df1.groupby(['state','county'])['cnt_a'].diff()

Original Data. => df

date        county  state       cnt_a    cnt_b
2020-06-13  Bergen  New Jersey   308     11
2020-06-14  Bergen  New Jersey   308     11
2020-06-15  Bergen  New Jersey   320     15
2020-06-12  Union   New Jersey   100     3
2020-06-13  Union   New Jersey   130     4
2020-06-14  Union   New Jersey   150     5
2020-06-12  Bronx   New York     200     100
2020-06-13  Bronx   New York     210     200

Wanted Output

date        county  state       cnt_a   cnt_b   daydiff_a    daydiff_b
2020-06-13  Bergen  New Jersey   308     11        0            0 
2020-06-14  Bergen  New Jersey   308     11        0            0
2020-06-15  Bergen  New Jersey   320     15        12           4
2020-06-12  Union   New Jersey   100     3         0            0
2020-06-13  Union   New Jersey   130     4         30           1
2020-06-14  Union   New Jersey   150     5         20           1
2020-06-12  Bronx   New York     200     100       0            0 
2020-06-13  Bronx   New York     210     200       10           100

Trenton McKinney · Accepted Answer

It's important to sort df, because df.groupby will be sorted. If df isn't sorted first, the joined columns from .groupby will not be in the same order as df.
- Be certain to df, in order, by 'state', 'country', and 'date', however, the 'date' column is ignored in .groupby.
.groupby on 'state' and 'country', and aggregate .diff to the desired columns.
fillna with 0, and .join df to the groupby object, based on the index.
- Specify rsuffix, and or use .rename to change the column headers.

import pandas as pd

# setup the test dataframe
data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
        'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
        'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
        'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
        'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}

df = pd.DataFrame(data)

# set the date column to a datetime format
df.date = pd.to_datetime(df.date)

# sort the values
df = df.sort_values(['state', 'county', 'date'])

# groupby and join back to dataframe df
df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')

# display(df)
        date  county       state  cnt_a  cnt_b  cnt_a_diff  cnt_b_diff
0 2020-06-13  Bergen  New Jersey    308     11         0.0         0.0
1 2020-06-14  Bergen  New Jersey    308     11         0.0         0.0
2 2020-06-15  Bergen  New Jersey    320     15        12.0         4.0
3 2020-06-12   Union  New Jersey    100      3         0.0         0.0
4 2020-06-13   Union  New Jersey    130      4        30.0         1.0
5 2020-06-14   Union  New Jersey    150      5        20.0         1.0
6 2020-06-12   Bronx    New York    200    100         0.0         0.0
7 2020-06-13   Bronx    New York    210    200        10.0       100.0

How to groupby multiple columns and aggregate diff on different columns?

Answers (1)

Related Questions