Reputation: 23099
I have a sort of inelegant solution for the following problem and I was wondering if there was a better way to handle the following sort of dataset,
I have a small data set of around 150k rows which has 3 essential rows I'm trying to wrangle :
ID Status Week
1 base 1
1 over 2
1 base 3
1 over 4
1 over 5
1 under 6
What would be the most efficient way of counting the final occurrences of over
?
essentially, for the data-set above I would say ID == 1
was last at the Status
over
starting at week 4
for a total of two weeks, 4
and 5
.
My method at the moment is to sort my df
by ID
and Week
, then groupby.tail()
to get the last occurrences of the over and under and sum the differences of the weeks.
I feel a bit silly as I'm sure this is a simple problem..!
target dataset
ID Duration of Last-Over, Start-Week, End-Week
1, 2, 4, 5
original DataFrame :
ID, Budget, Spend, Week, Status
1, 50, 50, 1, base
1 50, 55, 2, over
1 50, 50, 3, base
1 50, 250, 4, over
1 50, 300, 5, over
1 50, 42, 6, under.
Upvotes: 1
Views: 52
Reputation: 2019
This code below does that in two steps:
last = df[df.Status == 'over'][::-1][0:1].index
before_last = df[df.Status == 'over'][::-1][1:2].index
diff = df.iloc[last,3].values - df.iloc[before_last,3].values +1
ID = df.iloc[last,0]
new_df = pd.DataFrame({'ID': {0: ID.values[0]},
'Duration of Last-Over': {0: diff[0]},
'Start-Week': {0: df.iloc[before_last]['Week'].values[0]},
'End-Week': {0: df.iloc[last]['Week'].values[0]}})
Producing this output:
EDIT: maybe a more simple answer:
Getting just the last two occurrences of 'over' and manipulate it.
df_aux = df[df.Status == 'over'].tail(2)
new_df = pd.DataFrame({'ID': {0: df_aux.iloc[1,0]},
'Duration of Last-Over': {0: df_aux.iloc[1,3] - df_aux.iloc[0,3] +1},
'Start-Week': {0: df_aux.iloc[0,3]},
'End-Week': {0: df_aux.iloc[1,3]}})
Upvotes: 1