Reputation: 27
I am looking for help here on how to do this in Python / Panda:
I am looking to take the original data (below) and find the daily difference of multiple cols (cnt_a and cnt_b) by a group with multiple cols (state, county and date).
I've been trying it different ways, and I can't seem to get by the "check for duplicate" issue
df.cnt_a = df.sort_values(['state','county','date']).groupby['state','county','date','cnt_a'].diff(-1)
Tried splitting it out to fix one thing at a time:
df1 = df.sort_values(['state','county','date'])
df2 = df1.groupby(['state','county'])['cnt_a'].diff()
Original Data. => df
date county state cnt_a cnt_b
2020-06-13 Bergen New Jersey 308 11
2020-06-14 Bergen New Jersey 308 11
2020-06-15 Bergen New Jersey 320 15
2020-06-12 Union New Jersey 100 3
2020-06-13 Union New Jersey 130 4
2020-06-14 Union New Jersey 150 5
2020-06-12 Bronx New York 200 100
2020-06-13 Bronx New York 210 200
Wanted Output
date county state cnt_a cnt_b daydiff_a daydiff_b
2020-06-13 Bergen New Jersey 308 11 0 0
2020-06-14 Bergen New Jersey 308 11 0 0
2020-06-15 Bergen New Jersey 320 15 12 4
2020-06-12 Union New Jersey 100 3 0 0
2020-06-13 Union New Jersey 130 4 30 1
2020-06-14 Union New Jersey 150 5 20 1
2020-06-12 Bronx New York 200 100 0 0
2020-06-13 Bronx New York 210 200 10 100
Upvotes: 1
Views: 2669
Reputation: 62493
df
, because df.groupby
will be sorted. If df
isn't sorted first, the joined columns from .groupby
will not be in the same order as df
.
df
, in order, by 'state'
, 'country'
, and 'date'
, however, the 'date'
column is ignored in .groupby
..groupby
on 'state'
and 'country'
, and aggregate .diff
to the desired columns.fillna
with 0, and .join
df
to the groupby object, based on the index.
rsuffix
, and or use .rename
to change the column headers.import pandas as pd
# setup the test dataframe
data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}
df = pd.DataFrame(data)
# set the date column to a datetime format
df.date = pd.to_datetime(df.date)
# sort the values
df = df.sort_values(['state', 'county', 'date'])
# groupby and join back to dataframe df
df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')
# display(df)
date county state cnt_a cnt_b cnt_a_diff cnt_b_diff
0 2020-06-13 Bergen New Jersey 308 11 0.0 0.0
1 2020-06-14 Bergen New Jersey 308 11 0.0 0.0
2 2020-06-15 Bergen New Jersey 320 15 12.0 4.0
3 2020-06-12 Union New Jersey 100 3 0.0 0.0
4 2020-06-13 Union New Jersey 130 4 30.0 1.0
5 2020-06-14 Union New Jersey 150 5 20.0 1.0
6 2020-06-12 Bronx New York 200 100 0.0 0.0
7 2020-06-13 Bronx New York 210 200 10.0 100.0
Upvotes: 1