Reputation: 4626
Imagine that I have a DataFrame
with columns that contain only real values.
>> df
col1 col2 col3
0 0.907609 82 4.207991
1 3.743659 1523 6.488842
2 2.358696 324 5.092592
3 0.006793 0 0.000000
4 19.319746 11969 7.405685
I want to group it by quartiles (or any other percentiles specified by me) of the chosen column (e.g., col1
), to perform some operations on these groups. Ideally, I would like to do something like:
df.groupy( quartiles_of_col1 ).mean() # not working, how to code quartiles_of_col1?
The output should give the mean of each of the columns for four groups corresponding to the quartiles of col1
. Is this possible with the groupby
command? What's the simplest way of achieving it?
Upvotes: 6
Views: 10274
Reputation: 185
Pandas has a native solution, pandas.qcut
, to this as well:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
Upvotes: 0
Reputation: 54330
I don't have a computer to test it right now, but I think you can do it by: df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
. Will update after 150mins.
Some explanations:
In [42]:
#use np.percentile to get the bin edges of any percentile you want
np.percentile(df.col0, [0, 25, 75, 90, 100])
Out[42]:
[0.0067930000000000004,
0.907609,
3.7436589999999996,
13.089311200000001,
19.319745999999999]
In [43]:
#Need to use include_lowest=True
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]), include_lowest=True)).mean()
col0 col1 col2
col0
[0.00679, 0.908] 0.457201 41.0 2.103996
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
In [44]:
#Or the smallest values will be skiped
print df.groupby(pd.cut(df.col0, np.percentile(df.col0, [0, 25, 75, 90, 100]))).mean()
col0 col1 col2
col0
(0.00679, 0.908] 0.907609 82.0 4.207991
(0.908, 3.744] 3.051177 923.5 5.790717
(3.744, 13.0893] NaN NaN NaN
(13.0893, 19.32] 19.319746 11969.0 7.405685
Upvotes: 12
Reputation: 4120
I hope this will solve your problem. It is not pretty but I hope it will work for you
import pandas as pd
import random
import numpy as np
## create a mock df as example. with column A, B, C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
## select dataframe based on the quantile of column A, using the quantile method.
df[df['A'] < df['A'].quantile(0.3)].mean()
this will print
A -1.157615
B 0.205529
C -0.108263
D 0.346752
dtype: float64
Upvotes: 0