Reputation: 367
I would like to create new dataframe out of the old one in a way that there will only be values that exceed the mean value of the column. The problem for me is that of course each column has different mean and I do not want to calculate it separately and then clip each column with the different value. I tried double loop because there is different number of rows and columns but with no success. for example I have the following dataframe:
a b c
4 5 6
1 2 3
7 9 2
3 6 8
I calculate the mean for every column and then I want to create new dataframe with values bigger than the mean for the respective column so:
a1 b1 c1
4 9 6
7 6 8
I am not even sure if this is possible because it may happen that the columns in the new dataframe will have different dimensions but maybe we can fill the missing entries with NaN? I am not sure what the right solution should be.
Upvotes: 1
Views: 402
Reputation: 862581
You can compare values and then add NaN
s by indexing or where
:
df = df[df > df.mean()]
Or:
df = df.where(df > df.mean())
print (df)
a b c
0 4.0 NaN 6.0
1 NaN NaN NaN
2 7.0 9.0 NaN
3 NaN 6.0 8.0
If want remove NaN
s also in first rows add custom function with dropna
:
df = df[df > df.mean()].apply(lambda x: pd.Series(x.dropna().values))
print (df)
a b c
0 4.0 9.0 6.0
1 7.0 6.0 8.0
Generally if in some column is less values get NaN
s in end:
print (df)
a b c
0 4 5 6
1 1 2 3
2 7 9 2
3 3 6 8
4 3 6 8
print (df[df > df.mean()])
a b c
0 4.0 NaN 6.0
1 NaN NaN NaN
2 7.0 9.0 NaN
3 NaN 6.0 8.0
4 NaN 6.0 8.0
df = df[df > df.mean()].apply(lambda x: pd.Series(x.dropna().values))
print (df)
a b c
0 4.0 9.0 6.0
1 7.0 6.0 8.0
2 NaN 6.0 8.0
Upvotes: 3