deanpwr
deanpwr

Reputation: 191

Replace column in Pandas dataframe with the mean of that column

I have a dataframe:

df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

   A  B
0  1  2
1  1  3
2  4  6

I want to return a dataframe of the same size containing the mean of each column:

   A      B
0  2  3.666
1  2  3.666
2  2  3.666

Is there a simple way of doing this?

Upvotes: 3

Views: 2010

Answers (3)

yatu
yatu

Reputation: 88236

Here's one with assign:

df.assign(**df.mean())

    A         B
0  2.0  3.666667
1  2.0  3.666667
2  2.0  3.666667

Details

The mean is easily obtained with DataFrame.mean:

df.mean()

tenor_yrs    14.292857
rates         2.622000
dtype: float64

From the above Series, we can use dictionary unpacking to replace the existing columns with the resulting values. Note that we can unpack the Series into a dictionary using **:

{**df.mean()}
# {'tenor_yrs': 14.292857142857143, 'rates': 2.622}

Given that the way assign adds new columns is as df.assign(a_given_column=a_value, another_column=some_other_value), the unpacking makes the dictionary keys be the function's arguments. And since the original dataframe's index is respected, df.assign(**df.mean()) will replace the dataframe`s values with the means.

Upvotes: 2

ALollz
ALollz

Reputation: 59529

Recreate the DataFrame. Send the mean Series to a dict, then the index defines the number of rows.

pd.DataFrame(df.mean().to_dict(), index=df.index)

#     A         B
#0  2.0  3.666667
#1  2.0  3.666667
#2  2.0  3.666667

Same concept, but creating the full array first saves a decent amount of time.

pd.DataFrame(np.broadcast_to(df.mean(), df.shape), 
             index=df.index, 
             columns=df.columns)

Here are some timings. Of course this will depend slightly on the number of columns but you can see there are pretty large differences when you provide the entire array to begin with

import perfplot
import pandas as pd
import numpy as np

perfplot.show(
    setup=lambda N: pd.DataFrame(np.random.randint(1,100, (N, 5)),
                                 columns=[str(x) for x in range(5)]), 
    kernels=[
        lambda df: pd.DataFrame(np.broadcast_to(df.mean(), df.shape), index=df.index, columns=df.columns),
        lambda df: df.assign(**df.mean()),
        lambda df: pd.DataFrame(df.mean().to_dict(), index=df.index)
    ],
    labels=['numpy broadcast', 'assign', 'dict'],
    n_range=[2 ** k for k in range(1, 22)],
    equality_check=np.allclose,
    xlabel="Len(df)"
)

enter image description here

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148890

You can only provide one single line at DataFrame creation time:

pd.DataFrame(data = [df.mean()], index = df.index)

It gives:

     A         B
0  2.0  3.666667
1  2.0  3.666667
2  2.0  3.666667

Upvotes: 2

Related Questions