Reputation: 191
I have a dataframe:
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
A B
0 1 2
1 1 3
2 4 6
I want to return a dataframe of the same size containing the mean of each column:
A B
0 2 3.666
1 2 3.666
2 2 3.666
Is there a simple way of doing this?
Upvotes: 3
Views: 2010
Reputation: 88236
Here's one with assign
:
df.assign(**df.mean())
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667
Details
The mean is easily obtained with DataFrame.mean
:
df.mean()
tenor_yrs 14.292857
rates 2.622000
dtype: float64
From the above Series
, we can use dictionary unpacking to replace the existing columns with the resulting values. Note that we can unpack the Series
into a dictionary using **
:
{**df.mean()}
# {'tenor_yrs': 14.292857142857143, 'rates': 2.622}
Given that the way assign
adds new columns is as df.assign(a_given_column=a_value, another_column=some_other_value)
, the unpacking makes the dictionary keys be the function's arguments. And since the original dataframe's index is respected, df.assign(**df.mean())
will replace the dataframe`s values with the means.
Upvotes: 2
Reputation: 59529
Recreate the DataFrame. Send the mean Series to a dict, then the index defines the number of rows.
pd.DataFrame(df.mean().to_dict(), index=df.index)
# A B
#0 2.0 3.666667
#1 2.0 3.666667
#2 2.0 3.666667
Same concept, but creating the full array first saves a decent amount of time.
pd.DataFrame(np.broadcast_to(df.mean(), df.shape),
index=df.index,
columns=df.columns)
Here are some timings. Of course this will depend slightly on the number of columns but you can see there are pretty large differences when you provide the entire array to begin with
import perfplot
import pandas as pd
import numpy as np
perfplot.show(
setup=lambda N: pd.DataFrame(np.random.randint(1,100, (N, 5)),
columns=[str(x) for x in range(5)]),
kernels=[
lambda df: pd.DataFrame(np.broadcast_to(df.mean(), df.shape), index=df.index, columns=df.columns),
lambda df: df.assign(**df.mean()),
lambda df: pd.DataFrame(df.mean().to_dict(), index=df.index)
],
labels=['numpy broadcast', 'assign', 'dict'],
n_range=[2 ** k for k in range(1, 22)],
equality_check=np.allclose,
xlabel="Len(df)"
)
Upvotes: 1
Reputation: 148890
You can only provide one single line at DataFrame creation time:
pd.DataFrame(data = [df.mean()], index = df.index)
It gives:
A B
0 2.0 3.666667
1 2.0 3.666667
2 2.0 3.666667
Upvotes: 2