Tinchosawa
Tinchosawa

Reputation: 51

Subset original dataframe based on grouped quantiles

This is my df:

   NAME  DEPTH    A1        A2      A3      AA4     AA5     AI4     AC5     Surface 
0  Ron   2800.04  8440.53   1330.99 466.77  70.19   56.79   175.96  77.83   C
1  Ron   2801.04  6084.15   997.13  383.31  64.68   51.09   154.59  73.88   C 
2  Ron   2802.04  4496.09   819.93  224.12  62.18   47.61   108.25  63.86   C 
3  Ben   2803.04  5766.04   927.69  228.41  65.51   49.94   106.02  62.61   L 
4  Ron   2804.04  6782.89   863.88  223.79  63.68   47.69   101.95  61.83   L 
... ... ... ... ... ... ... ... ... ... ...

So, my first problem has been answered here: Find percentile in pandas dataframe based on groups

Using:

df.groupby('Surface')['DEPTH'].quantile([.1, .9])

I can get the percentiles [.1,.9] from DEPTH grouped by Surface, which is what I need:

Surface        
C           0.1    2800.24
            0.9    2801.84
L           0.1    3799.74
            0.9    3960.36
N           0.1    2818.24
            0.9    2972.86
P           0.1    3834.94
            0.9    4001.16
Q           0.1    3970.64
            0.9    3978.62
R           0.1    3946.14
            0.9    4115.96
S           0.1    3902.03
            0.9    4073.26
T           0.1    3858.14
            0.9    4029.96
U           0.1    3583.01
            0.9    3843.76
V           0.1    3286.01
            0.9    3551.06
Y           0.1    2917.00
            0.9    3135.86
X           0.1    3100.01
            0.9    3345.76
Z           0.1    4128.56
            0.9    4132.56
Name: DEPTH, dtype: float64

Now, I believe that was already the hardest part. What is left is subsetting the original df to include only the values in between those DEPTH percentiles .1 & .9. So for example: DEPTH values in Surface group "Z" have to be greater than 4128.56 and less than 4132.56. Note that I need df again, not df.groupby("Surface"): the final df would be exactly the same, but the rows whose depths are outside the borders should be dropped.

This seems so easy ... any ideas? Thanks!

Upvotes: 2

Views: 695

Answers (1)

ALollz
ALollz

Reputation: 59519

When you need to filter rows within groups it's often simpler and faster to use groupby + transform to broadcast the result to every row within a group and then filter the original DataFrame. In this case we can check if 'DEPTH' is between those two quantiles.

Sample Data

import pandas as pd
import numpy as np
np.random.seed(42)

df = pd.DataFrame({'DEPTH': np.random.normal(0,1,100), 
                   'Surface': np.random.choice(list('abcde'), 100)})

Code

gp = df.groupby('Surface')['DEPTH']

df1 = df[df['DEPTH'].between(gp.transform('quantile', 0.1),
                             gp.transform('quantile', 0.9))]

For clarity, here you can see that transform will broadcast the scalar result to every row that belongs to the group, in this case defined by 'Surface'

pd.concat([df['Surface'], gp.transform('quantile', 0.1).rename('q = 0.1')], axis=1)

#   Surface   q = 0.1
#0        a -1.164557
#1        e -0.967809
#2        a -1.164557
#3        c -1.426986
#4        b -1.544816
#..     ...       ...
#95       a -1.164557
#96       e -0.967809
#97       b -1.544816
#98       b -1.544816
#99       b -1.544816
# 
#[100 rows x 2 columns]

Upvotes: 1

Related Questions