MySQL Select Results Excluding Outliers Using AVG and STD Conditions

Question

I'm trying to write a query that excludes values beyond 6 standard deviations from the mean of the result set. I expect this can be done elegantly with a subquery, but I'm getting nowhere and in every similar case I've read the aim seems to be just a little different. My result set seems to get limited to a single row, I'm guessing due to calling the aggregate functions. Conceptually, this is what I'm after:

SELECT t.Result FROM
  (SELECT Result, AVG(Result) avgr, STD(Result) stdr
   FROM myTable WHERE myField=myCondition limit=75) as t
WHERE t.Result BETWEEN (t.avgr-6*t.stdr) AND (t.avgr+6*t.stdr)

I can get it to work by replacing each use of the STD or AVG value (ie. t.avgr) with it's own select statement as:

(SELECT AVG(Result) FROM myTable WHERE myField=myCondition limit=75)

However this seems waay more messy than I expect it needs to be (I've a few conditions). At first I thought specifying a HAVING clause was necessary, but as I learn more it doesn't seem to be quite what I'm after. Am I close? Is there some snazzy way to access the value of aggregate functions for use in conditions (without needing to return the aggregate values)?

John Bollinger · Accepted Answer

Yes, your subquery is an aggregate query with no GROUP BY clause, therefore its result is a single row. When you select from that, you cannot get more than one row. Moreover, it is a MySQL extension that you can include the Result field in the subquery's selection list at all, as it is neither a grouping column nor an aggregate function of the groups (so what does it even mean in that context unless, possibly, all the relevant column values are the same?).

You should be able to do something like this to compute the average and standard deviation once, together, instead of per-result:

SELECT t.Result FROM
  myTable AS t
  CROSS JOIN (
    SELECT AVG(Result) avgr, STD(Result) stdr
    FROM myTable
    WHERE myField = myCondition
  ) AS stats
WHERE 
  t.myField = myCondition
  AND t.Result BETWEEN (stats.avgr-6*stats.stdr) AND (stats.avgr+6*stats.stdr)
LIMIT 75

Note that you will want to be careful that the statistics are computed over the same set of rows that you are selecting from, hence the duplication of the myField = myCondition predicate, but also the removal of the LIMIT clause to the outer query only.

You can add more statistics to the aggregate subquery, provided that they are all computed over the same set of rows, or you can join additional statistics computed over different rows via a separate subquery. Do ensure that all your statistics subqueries return exactly one row each, else you will get duplicate (or no) results.

MySQL Select Results Excluding Outliers Using AVG and STD Conditions

Answers (2)

Usage

Related Questions