Joohun Lee
Joohun Lee

Reputation: 187

Detecting the outlier from rows by certain column in panda dataframe

I have datasets which measure voltage values in certain column. I'm looking for elegant way to extract the rows that is deviated from mean value. There are couple of group in "volt_id" and I'd like to have each group create their own mean/std and use them to decide which rows are deviated from each group. for example, I have original dataset as below.

      time     volt_id     value
 0    14         A         300.00
 1    15         A         310.00
 2    15         B         200.00
 3    16         B         210.00
 4    17         B         300.00
 5    14         C         100.00
 6    16         C         110.00
 7    20         C         200.00

After the algorithm running, I'd only keep row 4 and 7 which is highly deviated from their groups as below.

      time     volt_id     value
 4    17         B         300.00
 7    20         C         200.00

I could do this if there is only single group but my codes would be messy and lengthy if do this for multiple groups. I'd appreciate if there's simpler way to do this.

thanks,

Upvotes: 2

Views: 680

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

Similar to @COLDSPEED's solution:

In [179]: from scipy.stats import zscore

In [180]: df.loc[df.groupby('volt_id')['value'].transform(zscore) > 1]
Out[180]:
   time volt_id  value
4    17       B  300.0
7    20       C  200.0

Upvotes: 1

Amotttier
Amotttier

Reputation: 11

One way to do this would be using outliers: http://www.mathwords.com/o/outlier.htm

You would need to define your inner quartile range and first and third quartiles. You could then filter your data onsimple comparison.

Quartiles are not the only way to determine outliers howevet. Heres a discussion comparing standard deviation and quartiles for locating outliers: https://stats.stackexchange.com/questions/175999/determine-outliers-using-iqr-or-standard-deviation

Upvotes: 0

cs95
cs95

Reputation: 402263

You can compute and filter on the zscore on each group using groupby.

Assuming you want only those rows which are 1 or more standard deviations away from mean,

g = df.groupby('volt_id').value
v = (df.value - g.transform('mean')) / g.transform('std')

df[v.abs().ge(1)]

   time volt_id  value
4    17       B  300.0
7    20       C  200.0

Upvotes: 2

Related Questions