Reputation: 117
I have a dataframe which reads:
A 2007/Ago 2007/Set 2007/Out ... 2020/Jan 2020/Fev
row1 x number number number ... number number
row2 y number number number ... number number
row3 w number number number ... number number
...
row27 z number number number ... number number
I mean, there are numbers in each cell. I want to calculate the mean of the cells for which the columns starts with 2007, and then calculate the mean of the cells of which the columns starts with 2008, and then 2009, ..., and then 2020 and do this for each row.
What I tried to sketch is something like:
x = []
for i in df.row(i): #that is, for each row of the dataframe
if column.startswith('j'): #which starts with j=2008, 2009, 2010 etc
x += df[i][j] #the variable x gets the number on that row i,column j and sum
What I want in the end are various columsn with the results of the mean for each year, that is, I want
result1 result2 result3 ... resultn
mean colums mean colums mean columsn mean columsn
starts starts starts starts
with 2008 with 2009 with 2010 with 2020
That is, I want 13 new columns: one for each mean (years from 2008 to 2020).
I can't continue this loop and I do not know how much basic this is, but my questions are:
1- Are there any more optimal way of doing this? I mean, using pandas functions other than loops?
In my dataframe, each cell corresponds to the total cost of health expends in that month, and I want to take the mean of the cost of the entire year to compare it to the population of each city (which are thw rows). I am struggling with this for some time and I am not able to solve it. My level using pandas is very basic.
PS: sorry for the dataframe representation, I do not know how to properly write one in the stackoverflow's body question.
Upvotes: 1
Views: 86
Reputation: 856
You could iterate over the years, select the subset of columns and just use pandas' mean()
function to get the mean of that year:
means = {}
for year in range(2007, 2021):
# assuming df is your dataframe
sub_df = df.loc[:, df.columns.str.startswith(str(year))]
# first mean() aggregates per column, second mean() aggregates the whoöe year
means[year] = sub_df.mean().mean()
This yields a dict
with the years as key and the mean for that year as value. If there are no columns for one year, means[year] contains NaN.
Upvotes: 1
Reputation: 856
While re-working my other answer, I found this one-liner:
df.mean().groupby(lambda x: x[:4]).mean()
Pandas' `mean' function calculates the mean per column:
# using the DataFrame from Henry's answer:
df = pd.DataFrame({
'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
'2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
'2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
'2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
'2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
'2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})
# calculate mean per column
col_means = df.mean()
# 2007/Ago 2.5
# 2007/Set 6.5
# 2007/Out 10.5
# 2020/Jan 14.5
# 2020/Fev 18.5
# dtype: float64
# group above columns by first 4 characters, i.e., the year
year_groups = col_means.groupby(lambda x: x[:4])
# calculate the mean per year group
year_groups.mean()
# 2007 6.5
# 2020 16.5
# dtype: float64
Upvotes: 1
Reputation: 35686
An option via melt
+ pivot_table
with the aggfunc set to mean:
import pandas as pd
df = pd.DataFrame({
'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
'2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
'2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
'2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
'2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
'2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})
df = df.melt(id_vars='A', var_name='year')
# Rename month columns to their year value
df['year'] = df['year'].str.split('/').str[0]
# pivot to wide format based on the new year value
df = (
df.pivot_table(columns='year', index='A', aggfunc='mean')
.droplevel(0, 1)
.rename_axis(None)
.rename_axis(None, axis=1)
)
print(df)
df
:
2007 2020
w 7 17
x 5 15
y 6 16
z 8 18
Upvotes: 2
Reputation: 195553
Suppose you have this dataframe:
A 2007/Ago 2007/Set 2007/Out 2020/Jan 2020/Fev
row1 x 1 5 9 13 17
row2 y 2 6 10 14 18
row3 w 3 7 11 15 19
row27 z 4 8 12 16 20
You can use .filter()
and .mean(axis=1)
to compute the values:
df["result"] = df.filter(regex=r"^\d{4}").mean(axis=1)
print(df)
Prints:
A 2007/Ago 2007/Set 2007/Out 2020/Jan 2020/Fev result
row1 x 1 5 9 13 17 9.0
row2 y 2 6 10 14 18 10.0
row3 w 3 7 11 15 19 11.0
row27 z 4 8 12 16 20 12.0
Upvotes: 1