Belinda Zhang
Belinda Zhang

Reputation: 3

Create a function can add column to the end of Pandas DataFrame containing average of previous data

I have a DataFrame "Data" that contains the following:

Data (example):

State 2008-02-29 2008-03-31 2009-01-31 2009-02-28 2009-04-30
NY 7437.0 7930.0 9457.0 10037.0 12425.0
CA 3265.0 3855.0 5215.0 4163.0 5275.0

I want to add columns to this dataframe, containing the average of the values under column 2008-02-29 and 2008-03-31 for each row. Also, add column containing the average of the values under column 2009-01-31, 2009-02-28 and 2009-04-30 for each row.

The Data DataFrame might change size as my code reads from different CSV files later, so the function should be reusable regardless of adding or deleting some columns (i.e dataframe might contain data in 2010, 2011, 2021 and so on).

desired output:

State 2008-02-29 2008-03-31 2009-01-31 2009-02-28 2009-04-30 2008_avg 2009_avg
NY 7437.0 7930.0 9457.0 10037.0 12425.0 7683.5 10639.67
CA 3265.0 3855.0 5215.0 4163.0 5275.0 3560 4884.33

I tried the following code, but datafram has too many columns. I hope to find a more automated and efficient method.

Data['2008_avg'] = Data[['2008-02-29', '2008-03-31']].mean(axis=1)
Data['2009_avg'] = Data[['2009-01-31', '2009-02-28', '2009-04-30']].mean(axis=1)

Upvotes: 0

Views: 90

Answers (2)

marwan
marwan

Reputation: 494

my advise is for you to transpose your dataframe so that the State is a column, and the dates are the index. Then you can rely on pandas resample method for aggregating based on a desired frequency which in your case seems to be an annual frequency.

i.e. see the sample code below

In [29]: df = pd.DataFrame({'State': ['NY', 'CA'],
    ...:         '2008-02-29': [7437.0, 3265.0],
    ...:         '2008-03-31': [7930.0, 3855.0],
    ...:         '2009-01-31': [9457.0, 5215.0],
    ...:         '2009-02-28': [10037.0, 4163.0],
    ...:         '2009-04-30': [12425.0, 5275.0]}
    ...: )

In [30]: df
Out[30]: 
  State  2008-02-29  2008-03-31  2009-01-31  2009-02-28  2009-04-30
0    NY      7437.0      7930.0      9457.0     10037.0     12425.0
1    CA      3265.0      3855.0      5215.0      4163.0      5275.0

In [31]: out = df.set_index("State").T

In [32]: out
Out[32]: 
State            NY      CA
2008-02-29   7437.0  3265.0
2008-03-31   7930.0  3855.0
2009-01-31   9457.0  5215.0
2009-02-28  10037.0  4163.0
2009-04-30  12425.0  5275.0

In [33]: out.index = out.index.astype("datetime64[ns]")

In [34]: out = out.resample("A").mean()
Out[34]: 
State                 NY           CA
2008-12-31   7683.500000  3560.000000
2009-12-31  10639.666667  4884.333333

In [35]: out.T
Out[35]: 
       2008-12-31    2009-12-31
State                          
NY         7683.5  10639.666667
CA         3560.0   4884.333333

Upvotes: 1

I think this is what you are looking for:

import re
import pandas as pd

data = {'State': ['NY', 'CA'],
        '2008-02-29': [7437.0, 3265.0],
        '2008-03-31': [7930.0, 3855.0],
        '2009-01-31': [9457.0, 5215.0],
        '2009-02-28': [10037.0, 4163.0],
        '2009-04-30': [12425.0, 5275.0]}

df = pd.DataFrame(data)


def add_avg_columns(df):
    year_2008_columns = filter(re.compile("^2008").search, df.columns)
    df['2008_avg'] = df[list(year_2008_columns)].mean(axis=1)
    
    year_2009_columns = filter(re.compile("^2009").search, df.columns)
    df['2009_avg'] = df[list(year_2009_columns)].mean(axis=1)
    
    return df

Data = add_avg_columns(df)
print(Data)

which gives

 State  2008-02-29  2008-03-31  2009-01-31  2009-02-28  2009-04-30  2008_avg  \
0    NY      7437.0      7930.0      9457.0     10037.0     12425.0    7683.5   
1    CA      3265.0      3855.0      5215.0      4163.0      5275.0    3560.0   

       2009_avg  
0  10639.666667  
1   4884.333333  

Upvotes: 0

Related Questions