Create a function can add column to the end of Pandas DataFrame containing average of previous data

Question

I have a DataFrame "Data" that contains the following:

Data (example):

State	2008-02-29	2008-03-31	2009-01-31	2009-02-28	2009-04-30
NY	7437.0	7930.0	9457.0	10037.0	12425.0
CA	3265.0	3855.0	5215.0	4163.0	5275.0

I want to add columns to this dataframe, containing the average of the values under column 2008-02-29 and 2008-03-31 for each row. Also, add column containing the average of the values under column 2009-01-31, 2009-02-28 and 2009-04-30 for each row.

The Data DataFrame might change size as my code reads from different CSV files later, so the function should be reusable regardless of adding or deleting some columns (i.e dataframe might contain data in 2010, 2011, 2021 and so on).

desired output:

State	2008-02-29	2008-03-31	2009-01-31	2009-02-28	2009-04-30	2008_avg	2009_avg
NY	7437.0	7930.0	9457.0	10037.0	12425.0	7683.5	10639.67
CA	3265.0	3855.0	5215.0	4163.0	5275.0	3560	4884.33

I tried the following code, but datafram has too many columns. I hope to find a more automated and efficient method.

Data['2008_avg'] = Data[['2008-02-29', '2008-03-31']].mean(axis=1)
Data['2009_avg'] = Data[['2009-01-31', '2009-02-28', '2009-04-30']].mean(axis=1)

marwan · Accepted Answer

my advise is for you to transpose your dataframe so that the State is a column, and the dates are the index. Then you can rely on pandas resample method for aggregating based on a desired frequency which in your case seems to be an annual frequency.

i.e. see the sample code below

In [29]: df = pd.DataFrame({'State': ['NY', 'CA'],
    ...:         '2008-02-29': [7437.0, 3265.0],
    ...:         '2008-03-31': [7930.0, 3855.0],
    ...:         '2009-01-31': [9457.0, 5215.0],
    ...:         '2009-02-28': [10037.0, 4163.0],
    ...:         '2009-04-30': [12425.0, 5275.0]}
    ...: )

In [30]: df
Out[30]: 
  State  2008-02-29  2008-03-31  2009-01-31  2009-02-28  2009-04-30
0    NY      7437.0      7930.0      9457.0     10037.0     12425.0
1    CA      3265.0      3855.0      5215.0      4163.0      5275.0

In [31]: out = df.set_index("State").T

In [32]: out
Out[32]: 
State            NY      CA
2008-02-29   7437.0  3265.0
2008-03-31   7930.0  3855.0
2009-01-31   9457.0  5215.0
2009-02-28  10037.0  4163.0
2009-04-30  12425.0  5275.0

In [33]: out.index = out.index.astype("datetime64[ns]")

In [34]: out = out.resample("A").mean()
Out[34]: 
State                 NY           CA
2008-12-31   7683.500000  3560.000000
2009-12-31  10639.666667  4884.333333

In [35]: out.T
Out[35]: 
       2008-12-31    2009-12-31
State                          
NY         7683.5  10639.666667
CA         3560.0   4884.333333

Create a function can add column to the end of Pandas DataFrame containing average of previous data

Answers (2)

Related Questions