Reputation: 3
I have a DataFrame "Data" that contains the following:
Data (example):
State | 2008-02-29 | 2008-03-31 | 2009-01-31 | 2009-02-28 | 2009-04-30 |
---|---|---|---|---|---|
NY | 7437.0 | 7930.0 | 9457.0 | 10037.0 | 12425.0 |
CA | 3265.0 | 3855.0 | 5215.0 | 4163.0 | 5275.0 |
I want to add columns to this dataframe, containing the average of the values under column 2008-02-29 and 2008-03-31 for each row. Also, add column containing the average of the values under column 2009-01-31, 2009-02-28 and 2009-04-30 for each row.
The Data DataFrame might change size as my code reads from different CSV files later, so the function should be reusable regardless of adding or deleting some columns (i.e dataframe might contain data in 2010, 2011, 2021 and so on).
desired output:
State | 2008-02-29 | 2008-03-31 | 2009-01-31 | 2009-02-28 | 2009-04-30 | 2008_avg | 2009_avg |
---|---|---|---|---|---|---|---|
NY | 7437.0 | 7930.0 | 9457.0 | 10037.0 | 12425.0 | 7683.5 | 10639.67 |
CA | 3265.0 | 3855.0 | 5215.0 | 4163.0 | 5275.0 | 3560 | 4884.33 |
I tried the following code, but datafram has too many columns. I hope to find a more automated and efficient method.
Data['2008_avg'] = Data[['2008-02-29', '2008-03-31']].mean(axis=1)
Data['2009_avg'] = Data[['2009-01-31', '2009-02-28', '2009-04-30']].mean(axis=1)
Upvotes: 0
Views: 90
Reputation: 494
my advise is for you to transpose your dataframe so that the State is a column, and the dates are the index. Then you can rely on pandas resample method for aggregating based on a desired frequency which in your case seems to be an annual frequency.
i.e. see the sample code below
In [29]: df = pd.DataFrame({'State': ['NY', 'CA'],
...: '2008-02-29': [7437.0, 3265.0],
...: '2008-03-31': [7930.0, 3855.0],
...: '2009-01-31': [9457.0, 5215.0],
...: '2009-02-28': [10037.0, 4163.0],
...: '2009-04-30': [12425.0, 5275.0]}
...: )
In [30]: df
Out[30]:
State 2008-02-29 2008-03-31 2009-01-31 2009-02-28 2009-04-30
0 NY 7437.0 7930.0 9457.0 10037.0 12425.0
1 CA 3265.0 3855.0 5215.0 4163.0 5275.0
In [31]: out = df.set_index("State").T
In [32]: out
Out[32]:
State NY CA
2008-02-29 7437.0 3265.0
2008-03-31 7930.0 3855.0
2009-01-31 9457.0 5215.0
2009-02-28 10037.0 4163.0
2009-04-30 12425.0 5275.0
In [33]: out.index = out.index.astype("datetime64[ns]")
In [34]: out = out.resample("A").mean()
Out[34]:
State NY CA
2008-12-31 7683.500000 3560.000000
2009-12-31 10639.666667 4884.333333
In [35]: out.T
Out[35]:
2008-12-31 2009-12-31
State
NY 7683.5 10639.666667
CA 3560.0 4884.333333
Upvotes: 1
Reputation: 11504
I think this is what you are looking for:
import re
import pandas as pd
data = {'State': ['NY', 'CA'],
'2008-02-29': [7437.0, 3265.0],
'2008-03-31': [7930.0, 3855.0],
'2009-01-31': [9457.0, 5215.0],
'2009-02-28': [10037.0, 4163.0],
'2009-04-30': [12425.0, 5275.0]}
df = pd.DataFrame(data)
def add_avg_columns(df):
year_2008_columns = filter(re.compile("^2008").search, df.columns)
df['2008_avg'] = df[list(year_2008_columns)].mean(axis=1)
year_2009_columns = filter(re.compile("^2009").search, df.columns)
df['2009_avg'] = df[list(year_2009_columns)].mean(axis=1)
return df
Data = add_avg_columns(df)
print(Data)
which gives
State 2008-02-29 2008-03-31 2009-01-31 2009-02-28 2009-04-30 2008_avg \
0 NY 7437.0 7930.0 9457.0 10037.0 12425.0 7683.5
1 CA 3265.0 3855.0 5215.0 4163.0 5275.0 3560.0
2009_avg
0 10639.666667
1 4884.333333
Upvotes: 0