Reputation: 187
I have datasets which measure voltage values in certain column. I'm looking for elegant way to extract the rows that is deviated from mean value. There are couple of group in "volt_id" and I'd like to have each group create their own mean/std and use them to decide which rows are deviated from each group. for example, I have original dataset as below.
time volt_id value
0 14 A 300.00
1 15 A 310.00
2 15 B 200.00
3 16 B 210.00
4 17 B 300.00
5 14 C 100.00
6 16 C 110.00
7 20 C 200.00
After the algorithm running, I'd only keep row 4 and 7 which is highly deviated from their groups as below.
time volt_id value
4 17 B 300.00
7 20 C 200.00
I could do this if there is only single group but my codes would be messy and lengthy if do this for multiple groups. I'd appreciate if there's simpler way to do this.
thanks,
Upvotes: 2
Views: 680
Reputation: 210832
Similar to @COLDSPEED's solution:
In [179]: from scipy.stats import zscore
In [180]: df.loc[df.groupby('volt_id')['value'].transform(zscore) > 1]
Out[180]:
time volt_id value
4 17 B 300.0
7 20 C 200.0
Upvotes: 1
Reputation: 11
One way to do this would be using outliers: http://www.mathwords.com/o/outlier.htm
You would need to define your inner quartile range and first and third quartiles. You could then filter your data onsimple comparison.
Quartiles are not the only way to determine outliers howevet. Heres a discussion comparing standard deviation and quartiles for locating outliers: https://stats.stackexchange.com/questions/175999/determine-outliers-using-iqr-or-standard-deviation
Upvotes: 0
Reputation: 402263
You can compute and filter on the zscore on each group
using groupby
.
Assuming you want only those rows which are 1 or more standard deviations away from mean,
g = df.groupby('volt_id').value
v = (df.value - g.transform('mean')) / g.transform('std')
df[v.abs().ge(1)]
time volt_id value
4 17 B 300.0
7 20 C 200.0
Upvotes: 2