Reputation: 159
I've a dataframe like this,
Name COST Timestamp
0 c 8 2023-09-20 15:14:46
1 a 8 2023-09-20 15:14:48
2 c 9 2023-09-20 15:14:55
3 b 10 2023-09-20 15:15:00
4 c 4 2023-09-20 15:15:02
5 a 9 2023-09-20 15:15:04
6 b 3 2023-09-20 15:15:12
7 a 3 2023-09-20 15:15:17
8 c 6 2023-09-20 15:15:20
9 c 6 2023-09-20 15:15:29
What I want is to create a new dataframe. It will look for the sequence of a,b,c (order doesn't matter here) and sum_cost is sum of cost of a,b,c and it's time_stamp will be the last time_stamp of a,b,c whichever you get. Incase of multiple entries of a same name say like b,c,c,b,b,a then take the last c then then the last b and a.
An detailed example:
output should be from the given dataframe will be constructed like this, for first one it will take 'a' of index 1, 'c' of index 2 (as this came after 'c' of index 0) and 'b' of index 3, the sum_cost will be 27 for this group and timestamp will be 2023-09-20 00:14:26 which is of 'b' of index 3 as it came last for this group. And next group it will take 'c' of index 4, 'a' of index 5 and 'b' of index 6 and sum_cost will be 16 and timestamp will be 2023-09-20 00:06:51 which is of 'b' of index 6 as it came last of this sequence.
The output will look like this,
sum_cost Timestamp
0 27 2023-09-20 15:15:00
1 16 2023-09-20 15:15:12
Please help me with this. Thank you!
Upvotes: 1
Views: 90
Reputation: 261860
You need to build a custom grouper, for that a loop is required, here using a custom function:
def group_consecutive(s, target):
out = []
i = 0
g = 1
while i < len(s)-len(target)+1:
if target == set(s.iloc[i:i+len(target)]):
out.extend([g]*len(target))
g += 1
i += len(target)
else:
out.append(0)
i += 1
out.extend([0]*(len(target)-1))
return pd.Series(out, index=s.index)
group = group_consecutive(df['Name'], {'a', 'b', 'c'})
out = df[group>0].groupby(group).agg({'COST': 'sum', 'Timestamp': 'last'})
Output:
COST Timestamp
1 27 2023-09-20 15:15:00
2 16 2023-09-20 15:15:12
Intermediate with group:
Name COST Timestamp group
0 c 8 2023-09-20 15:14:46 0
1 a 8 2023-09-20 15:14:48 1
2 c 9 2023-09-20 15:14:55 1
3 b 10 2023-09-20 15:15:00 1
4 c 4 2023-09-20 15:15:02 2
5 a 9 2023-09-20 15:15:04 2
6 b 3 2023-09-20 15:15:12 2
7 a 3 2023-09-20 15:15:17 0
8 c 6 2023-09-20 15:15:20 0
9 c 6 2023-09-20 15:15:29 0
Upvotes: 1
Reputation: 46
You can try this code You need to change the path to your dataframe
from datetime import datetime
import pandas as pd
df =pd.read_csv("df.csv")
a_tmp= b_tmp= c_tmp= max_tmp = datetime(1998, 5, 3)
a_value= b_value= c_value = None
df2 = pd.DataFrame(columns=['sum_cost','timestamp'])
for row in range(len(df)):
if df['Name'][row] == 'a' and datetime.strptime(df['Timestamp'][row],"%Y-%m-%d %H:%M:%S")> a_tmp:
a_value = df['COST'][row]
a_tmp = datetime.strptime(df['Timestamp'][row],"%Y-%m-%d %H:%M:%S")
if a_tmp > max_tmp:
max_tmp = a_tmp
elif df['Name'][row] == 'b' and datetime.strptime(df['Timestamp'][row],"%Y-%m-%d %H:%M:%S")> b_tmp:
b_value = df['COST'][row]
b_tmp = datetime.strptime(df['Timestamp'][row],"%Y-%m-%d %H:%M:%S")
if b_tmp > max_tmp:
max_tmp = b_tmp
elif df['Name'][row] == 'c' and datetime.strptime(df['Timestamp'][row],"%Y-%m-%d %H:%M:%S")> c_tmp:
c_value = df['COST'][row]
c_tmp = datetime.strptime(df['Timestamp'][row],"%Y-%m-%d %H:%M:%S")
if c_tmp > max_tmp:
max_tmp = c_tmp
if a_value is not None and b_value is not None and c_value is not None:
sum_of_value = a_value+ b_value + c_value
df2.loc[len(df2.index)] =[sum_of_value,max_tmp]
a_tmp= b_tmp= c_tmp= max_tmp = datetime(1998, 5, 3)
a_value = b_value = c_value = None
df2
Upvotes: 1