question.it
question.it

Reputation: 2968

How to get average for each date by considering current date and previous all date data

I have below dataframe and I want to calcultae cumulative average for each date by considering current date and previous all date data.

df = pd.DataFrame({'Items':['Item1', 'Item2', 'Item1', 'Item2', 'Item1', 'Item2', 'Item1', 'Item2', 'Item1', 'Item2', 'Item1'],
         'Variable': ['V1', 'V2', 'V1', 'V2', 'V1', 'V2', 'V1', 'V2', 'V1', 'V2', 'V1'],
         'Date': ['2020-12-16', '2020-12-16', '2020-12-16', '2020-12-16', '2020-12-17', '2020-12-17', '2020-12-17', '2020-12-17', '2020-12-18', '2020-12-18', '2020-12-18'],
         'Value': [5, 2, 5, 1, 1, 1, 1, 2, 1, 1, 1]})

df = df.sort_values(['Date'], ascending=[True])

but below script does not help:

df.groupby(['Items', 'Variable', 'Date'])['Value'].expanding().mean().reset_index(name='Value')

I need result like below: enter image description here

In MS excel we can find average for latest date -2018 by selecting all previous rows like below:

enter image description here

like above I want to calculate for all dates

Upvotes: 1

Views: 73

Answers (2)

Pygirl
Pygirl

Reputation: 13349

Try:

df = df.sort_values(['Items', 'Date', 'Variable'], ascending=[True, True, True])

x = df.reset_index().groupby(['Items', 'Variable'])['Value']
index = x.cumcount()+1
df['Value'] = x.cumsum()/(index.values)

df1 = df[np.where(df[['Items', 'Variable', 'Date']].duplicated(keep='last'), False, True)].reset_index(drop=True)

df1:

    Items   Variable    Date    Value
0   Item1   V1       2020-12-16 5.000000
1   Item1   V1       2020-12-17 3.000000
2   Item1   V1       2020-12-18 2.333333
3   Item2   V2       2020-12-16 1.500000
4   Item2   V2       2020-12-17 1.500000
5   Item2   V2       2020-12-18 1.400000

Edit:

Instead of df[np.where(df[['Items', 'Variable', 'Date']].duplicated(keep='last'), False, True)].reset_index(drop=True)

use df.drop_duplicates(subset=['Items', 'Variable', 'Date'], keep='last').reset_index(drop=True)

Upvotes: 4

From what I can see from you excel sheet, this is what you want to do:

df = df.sort_values(['Variable','Date'], ascending=[True,True])
df['cummean'] = df.groupby(['Variable'])['Value'].transform(lambda x: x.rolling(6,1).mean())

which return

    Items Variable        Date  Value   cummean
0   Item1       V1  2020-12-16      5  5.000000
2   Item1       V1  2020-12-16      1  3.000000
4   Item1       V1  2020-12-17      1  2.333333
6   Item1       V1  2020-12-17      1  2.000000
8   Item1       V1  2020-12-18      1  1.800000
10  Item1       V1  2020-12-18      1  1.666667
1   Item2       V2  2020-12-16      5  5.000000
3   Item2       V2  2020-12-16      1  3.000000
5   Item2       V2  2020-12-17      1  2.333333
7   Item2       V2  2020-12-17      2  2.250000
9   Item2       V2  2020-12-18      1  2.000000

Upvotes: 1

Related Questions