Dimitri
Dimitri

Reputation: 117

Iteration between rows and columsn of a DataFrame to calculate the mean

I have a dataframe which reads:

       A  2007/Ago  2007/Set  2007/Out ... 2020/Jan 2020/Fev
row1   x   number     number    number ...   number   number
row2   y   number     number    number ...   number   number
row3   w   number     number    number ...   number   number
...      
row27  z   number     number    number ...   number   number

I mean, there are numbers in each cell. I want to calculate the mean of the cells for which the columns starts with 2007, and then calculate the mean of the cells of which the columns starts with 2008, and then 2009, ..., and then 2020 and do this for each row.

What I tried to sketch is something like:

x = []

for i in df.row(i):             #that is, for each row of the dataframe
    if column.startswith('j'):  #which starts with j=2008, 2009, 2010 etc
        x += df[i][j]         #the variable x gets the number on that row i,column j and sum

What I want in the end are various columsn with the results of the mean for each year, that is, I want

result1       result2          result3     ...     resultn
mean colums   mean colums    mean columsn        mean columsn
 starts          starts        starts               starts
 with 2008      with 2009      with 2010          with 2020

That is, I want 13 new columns: one for each mean (years from 2008 to 2020).

I can't continue this loop and I do not know how much basic this is, but my questions are:

1- Are there any more optimal way of doing this? I mean, using pandas functions other than loops?

In my dataframe, each cell corresponds to the total cost of health expends in that month, and I want to take the mean of the cost of the entire year to compare it to the population of each city (which are thw rows). I am struggling with this for some time and I am not able to solve it. My level using pandas is very basic.

PS: sorry for the dataframe representation, I do not know how to properly write one in the stackoverflow's body question.

Upvotes: 1

Views: 86

Answers (4)

Valentin Kuhn
Valentin Kuhn

Reputation: 856

You could iterate over the years, select the subset of columns and just use pandas' mean() function to get the mean of that year:

means = {}
for year in range(2007, 2021):
    # assuming df is your dataframe
    sub_df = df.loc[:, df.columns.str.startswith(str(year))]
    # first mean() aggregates per column, second mean() aggregates the whoöe year
    means[year] = sub_df.mean().mean()

This yields a dict with the years as key and the mean for that year as value. If there are no columns for one year, means[year] contains NaN.

Upvotes: 1

Valentin Kuhn
Valentin Kuhn

Reputation: 856

While re-working my other answer, I found this one-liner:

df.mean().groupby(lambda x: x[:4]).mean()

Explanation

Pandas' `mean' function calculates the mean per column:

# using the DataFrame from Henry's answer:
df = pd.DataFrame({
    'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
    '2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
    '2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
    '2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
    '2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
    '2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})

# calculate mean per column
col_means = df.mean()
# 2007/Ago     2.5
# 2007/Set     6.5
# 2007/Out    10.5
# 2020/Jan    14.5
# 2020/Fev    18.5
# dtype: float64

# group above columns by first 4 characters, i.e., the year
year_groups = col_means.groupby(lambda x: x[:4])

# calculate the mean per year group
year_groups.mean()
# 2007     6.5
# 2020    16.5
# dtype: float64

Upvotes: 1

Henry Ecker
Henry Ecker

Reputation: 35686

An option via melt + pivot_table with the aggfunc set to mean:

import pandas as pd

df = pd.DataFrame({
    'A': {'row1': 'x', 'row2': 'y', 'row3': 'w', 'row27': 'z'},
    '2007/Ago': {'row1': 1, 'row2': 2, 'row3': 3, 'row27': 4},
    '2007/Set': {'row1': 5, 'row2': 6, 'row3': 7, 'row27': 8},
    '2007/Out': {'row1': 9, 'row2': 10, 'row3': 11, 'row27': 12},
    '2020/Jan': {'row1': 13, 'row2': 14, 'row3': 15, 'row27': 16},
    '2020/Fev': {'row1': 17, 'row2': 18, 'row3': 19, 'row27': 20}
})

df = df.melt(id_vars='A', var_name='year')
# Rename month columns to their year value
df['year'] = df['year'].str.split('/').str[0]

# pivot to wide format based on the new year value
df = (
    df.pivot_table(columns='year', index='A', aggfunc='mean')
    .droplevel(0, 1)
    .rename_axis(None)
    .rename_axis(None, axis=1)
)
print(df)

df:

   2007  2020
w     7    17
x     5    15
y     6    16
z     8    18

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195553

Suppose you have this dataframe:

       A  2007/Ago  2007/Set  2007/Out  2020/Jan  2020/Fev
row1   x         1         5         9        13        17
row2   y         2         6        10        14        18
row3   w         3         7        11        15        19
row27  z         4         8        12        16        20

You can use .filter() and .mean(axis=1) to compute the values:

df["result"] = df.filter(regex=r"^\d{4}").mean(axis=1)
print(df)

Prints:

       A  2007/Ago  2007/Set  2007/Out  2020/Jan  2020/Fev  result
row1   x         1         5         9        13        17     9.0
row2   y         2         6        10        14        18    10.0
row3   w         3         7        11        15        19    11.0
row27  z         4         8        12        16        20    12.0

Upvotes: 1

Related Questions