Mine
Mine

Reputation: 861

Take sum of values before the row's date

I have a dataframe that looks like this:


df = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
 'date': {0: '11/11/2018',
  1: '11/12/2018',
  2: '11/13/2018',
  3: '11/14/2018',
  4: '11/15/2018',
  5: '11/16/2018'},
 'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5}})

I need the resulting dataframe to look like this:

output = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
 'date': {0: '11/11/2018',
  1: '11/12/2018',
  2: '11/13/2018',
  3: '11/14/2018',
  4: '11/15/2018',
  5: '11/16/2018'},
 'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5},
 'total_score_per_id_before_date': {0: 1, 1: 1, 2: 3, 3: 3, 4: 1, 5: 1}})

my code so far:

output= df[["id","score"]].groupby("id").sum()

However, this gives me the total sum of scores for each id. I need the sum of scores before the date in that specific row. Only the first score should not be discarded.

Upvotes: 0

Views: 555

Answers (1)

Oliver W.
Oliver W.

Reputation: 13459

Use the cumulative sum on a series. Then subtract the current values, as you asked for the cumulative sum before the current index. Finally, add back the first values, otherwise they’re zero.

previously_accumulated_scores = df.groupby("id").cumsum().score - df.score

firsts = df.groupby("id").first().reset_index()
df2 = df.merge(firsts, on=["id", "date"], how="left", suffixes=("", "_r"))

df["total_score_per_id_before_date"] = previously_accumulated_scores + df2.score_r.fillna(0)

The merge could be done more elegantly, by changing the index to a MultiIndex, but that’s a style preference.

Note: this assumes your DataFrame is sorted by the date-like column (groupby preserves the order of rows within each group (source: docs)).

Upvotes: 2

Related Questions