Lucas Billett
Lucas Billett

Reputation: 23

MySQL Select Results Excluding Outliers Using AVG and STD Conditions

I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:

SELECT t.Result FROM
  (SELECT Result, AVG(Result) avgr, STD(Result) stdr
   FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)

I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:

(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75) 

However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?

Upvotes: 2

Views: 4182

Answers (2)

Brian Leishman
Brian Leishman

Reputation: 8555

I created a UDF that doesn't calculate exactly the way you asked (it discards a percent of the results from the top and bottom, instead of using std), but it might be useful for you (or someone else) anyway, matching the Excel function referenced here https://support.office.com/en-us/article/trimmean-function-d90c9878-a119-4746-88fa-63d988f511d3

https://github.com/StirlingMarketingGroup/mysql-trimmean

Usage

`trimmean` ( `NumberColumn`, double `Percent` [, integer `Decimals` = 4 ] )
  • `NumberColumn`

    • The column of values to trim and average.
  • `Percent`

    • The fractional number of data points to exclude from the calculation. For example, if percent = 0.2, 4 points are trimmed from a data set of 20 points (20 x 0.2): 2 from the top and 2 from the bottom of the set.
  • `Decimals`

    • Optionally, the number of decimal places to output. Default is 4.

Upvotes: 1

John Bollinger
John Bollinger

Reputation: 180389

Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).

You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:

SELECT t.Result FROM
  myTable AS t
  CROSS JOIN (
    SELECT AVG(Result) avgr, STD(Result) stdr
    FROM myTable
    WHERE myField = myCondition
  ) AS stats
WHERE 
  t.myField = myCondition
  AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75

Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.

You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.

Upvotes: 2

Related Questions