Reputation: 1123
I am given a dataframe with cumulative count data. An example is generated as follows (feel free to skip:
import numpy as np
import pandas as pd
cols = ['Start', 'End', 'Count']
data = np.array([
'2020-1-1', '2020-1-2', 4,
'2020-1-1', '2020-1-3', 6,
'2020-1-1', '2020-1-4', 8,
'2020-2-1', '2020-2-2', 3,
'2020-2-1', '2020-2-3', 4,
'2020-2-1', '2020-2-4', 4])
data = data.reshape((6,3))
df = pd.DataFrame(columns=cols, data=data)
df['Start'] = pd.to_datetime(df.Start)
df['End'] = pd.to_datetime(df.End)
This gives the following dataframe:
Start End Count
2020-1-1 2020-1-2 4
2020-1-1 2020-1-3 6
2020-1-1 2020-1-4 8
2020-2-1 2020-2-2 3
2020-2-1 2020-2-3 4
2020-2-1 2020-2-4 4
The counts are cumulative (accumulation starts on Start) and I want to undo the accumulation to get (note the change in dates):
Start End Count
2020-1-1 2020-1-2 4
2020-1-2 2020-1-3 2
2020-1-3 2020-1-4 2
2020-2-1 2020-2-2 3
2020-2-2 2020-2-3 1
2020-2-3 2020-2-4 0
I would like to do this for grouped variables. This can be done naively by:
lst = []
for start, data in df.groupby(['Start', 'grouping_variable']):
data = data.sort_values('End')
diff = data.Count.diff()
diff.iloc[0] = data.Count.iloc[0]
start_dates = [data.Start.iloc[0]] + list(data.end[:-1].values)
data = data.assign(Start=start_dates,
Count=diff)
lst.append(data)
df = pd.concat(lst)
This does not feel "right", "pythonic" or "clean" in any way. Is there a better way? Perhaps Pandas has a specific method to do this?
Upvotes: 1
Views: 67
Reputation: 23099
IIUC, we can use cumcount
with a boolean to capture each unique start date group then apply a np.where
operation using shift
to each group.
import numpy as np
#df['Count'] = df['Count'].astype(int)
s = df.groupby(['Start']).cumcount() == 0
df['Count'] = np.where(s,df['Count'],df['Count'] - df['Count'].shift())
df['Start'] = np.where(s, df['Start'], df['End'].shift(1))
print(df)
Start End Count
0 2020-01-01 2020-01-02 4.0
1 2020-01-02 2020-01-03 2.0
2 2020-01-03 2020-01-04 2.0
3 2020-02-01 2020-02-02 3.0
4 2020-02-02 2020-02-03 1.0
5 2020-02-03 2020-02-04 0.0
Upvotes: 1