JamesH
JamesH

Reputation: 27

How to groupby multiple columns and aggregate diff on different columns?

I am looking for help here on how to do this in Python / Panda:

I am looking to take the original data (below) and find the daily difference of multiple cols (cnt_a and cnt_b) by a group with multiple cols (state, county and date).

I've been trying it different ways, and I can't seem to get by the "check for duplicate" issue

df.cnt_a = df.sort_values(['state','county','date']).groupby['state','county','date','cnt_a'].diff(-1)

Tried splitting it out to fix one thing at a time:

df1 = df.sort_values(['state','county','date'])

df2 = df1.groupby(['state','county'])['cnt_a'].diff()

Original Data. => df

date        county  state       cnt_a    cnt_b
2020-06-13  Bergen  New Jersey   308     11
2020-06-14  Bergen  New Jersey   308     11
2020-06-15  Bergen  New Jersey   320     15
2020-06-12  Union   New Jersey   100     3
2020-06-13  Union   New Jersey   130     4
2020-06-14  Union   New Jersey   150     5
2020-06-12  Bronx   New York     200     100
2020-06-13  Bronx   New York     210     200

Wanted Output

date        county  state       cnt_a   cnt_b   daydiff_a    daydiff_b
2020-06-13  Bergen  New Jersey   308     11        0            0 
2020-06-14  Bergen  New Jersey   308     11        0            0
2020-06-15  Bergen  New Jersey   320     15        12           4
2020-06-12  Union   New Jersey   100     3         0            0
2020-06-13  Union   New Jersey   130     4         30           1
2020-06-14  Union   New Jersey   150     5         20           1
2020-06-12  Bronx   New York     200     100       0            0 
2020-06-13  Bronx   New York     210     200       10           100

Upvotes: 1

Views: 2669

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62493

  • It's important to sort df, because df.groupby will be sorted. If df isn't sorted first, the joined columns from .groupby will not be in the same order as df.
    • Be certain to df, in order, by 'state', 'country', and 'date', however, the 'date' column is ignored in .groupby.
  • .groupby on 'state' and 'country', and aggregate .diff to the desired columns.
  • fillna with 0, and .join df to the groupby object, based on the index.
    • Specify rsuffix, and or use .rename to change the column headers.
import pandas as pd

# setup the test dataframe
data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
        'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
        'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
        'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
        'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}

df = pd.DataFrame(data)

# set the date column to a datetime format
df.date = pd.to_datetime(df.date)

# sort the values
df = df.sort_values(['state', 'county', 'date'])

# groupby and join back to dataframe df
df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')

# display(df)
        date  county       state  cnt_a  cnt_b  cnt_a_diff  cnt_b_diff
0 2020-06-13  Bergen  New Jersey    308     11         0.0         0.0
1 2020-06-14  Bergen  New Jersey    308     11         0.0         0.0
2 2020-06-15  Bergen  New Jersey    320     15        12.0         4.0
3 2020-06-12   Union  New Jersey    100      3         0.0         0.0
4 2020-06-13   Union  New Jersey    130      4        30.0         1.0
5 2020-06-14   Union  New Jersey    150      5        20.0         1.0
6 2020-06-12   Bronx    New York    200    100         0.0         0.0
7 2020-06-13   Bronx    New York    210    200        10.0       100.0

Upvotes: 1

Related Questions