EChan
EChan

Reputation: 105

Calculate mean of each column of pandas dataframe based on condition (i.e. averageifs)

I don't know why I'm struggling so hard with this one. I'm trying to do the excel equivalent of an averageifs calc across a pandas dataframe.

I have the following:

df = pd.DataFrame(rng.rand(1000, 7), columns=['1/31/2019', '2/28/2019', '3/31/2019', '4/30/2019', '5/31/2019', '6/30/2019', '7/31/2019'])

I also have a column:

df['Doc_Number'] = ['A', 'B', 'C', 'B', 'C', 'B', 'A', 'A', 'D', 'G', 'G', 'D', 'G', 'B' ...]

I want to do the excel equivalent of averageifs on the Doc_Number on each column of the df while maintaining the structure of the dataframe. So in each column, I'd calc the mean if df['Doc_Number'] = ['A', 'B', 'C'...] but I'd still maintain the 1,000 rows and I'd apply the calc to each individual column ['1/31/2019', '2/28/2019', '3/31/2019' ...].

For a single column, I would do something like:

df['AverageIfs'] = df.groupby('Doc_Number')['1/31/2019'].transform('np.mean')

But how would you apply the calc to each column of the df? In reality, I have many more columns to apply the calc across.

I'm a complete amateur so thanks for putting up with my questions.

Upvotes: 1

Views: 118

Answers (1)

jezrael
jezrael

Reputation: 862851

You can remove ['1/31/2019'] after groupby for process all columns to new DataFramme, change columns names with add_suffix and add to original by join:

#simplify df for easy check output
np.random.seed(123)
df = pd.DataFrame(np.random.rand(14, 2), columns=['1/31/2019', '2/28/2019'])
df['Doc_Number'] = ['A', 'B', 'C', 'B', 'C', 'B', 'A', 'A', 'D', 'G', 'G', 'D', 'G', 'B']

print (df)
    1/31/2019  2/28/2019 Doc_Number
0    0.696469   0.286139          A
1    0.226851   0.551315          B
2    0.719469   0.423106          C
3    0.980764   0.684830          B
4    0.480932   0.392118          C
5    0.343178   0.729050          B
6    0.438572   0.059678          A
7    0.398044   0.737995          A
8    0.182492   0.175452          D
9    0.531551   0.531828          G
10   0.634401   0.849432          G
11   0.724455   0.611024          D
12   0.722443   0.322959          G
13   0.361789   0.228263          B

df = df.join(df.groupby('Doc_Number').transform('mean').add_suffix('_mean'))
print (df)
    1/31/2019  2/28/2019 Doc_Number  1/31/2019_mean  2/28/2019_mean
0    0.696469   0.286139          A        0.511029        0.361271
1    0.226851   0.551315          B        0.478146        0.548364
2    0.719469   0.423106          C        0.600200        0.407612
3    0.980764   0.684830          B        0.478146        0.548364
4    0.480932   0.392118          C        0.600200        0.407612
5    0.343178   0.729050          B        0.478146        0.548364
6    0.438572   0.059678          A        0.511029        0.361271
7    0.398044   0.737995          A        0.511029        0.361271
8    0.182492   0.175452          D        0.453474        0.393238
9    0.531551   0.531828          G        0.629465        0.568073
10   0.634401   0.849432          G        0.629465        0.568073
11   0.724455   0.611024          D        0.453474        0.393238
12   0.722443   0.322959          G        0.629465        0.568073
13   0.361789   0.228263          B        0.478146        0.548364

Upvotes: 2

Related Questions