Reputation: 1194
say I have a postgresql table with the following values:
id | value
----------
1 | 4
2 | 8
3 | 100
4 | 5
5 | 7
If I use postgresql to calculate the average, it gives me an average of 24.8 because the high value of 100 has great impact on the calculation. While in fact I would like to find an average somewhere around 6 and eliminate the extreme(s).
I am looking for a way to eliminate extremes and want to do this "statistically correct". The extreme's cannot be fixed. I cannot say; If a value is over X, it has to be eliminated.
I have been bending my head on the postgresql aggregate functions but cannot put my finger on what is right for me to use. Any suggestions?
Upvotes: 5
Views: 10384
Reputation: 1541
You can use IQR to filter outliers. PL/pgSQL code:
select percentile_cont(0.25) WITHIN GROUP (ORDER BY value)
into q1
from table;
select percentile_cont(0.75) WITHIN GROUP (ORDER BY value)
into q3
from table;
iqr := q3 - q1;
min := q1 - 1.5 * iqr;
max := q3 + 1.5 * iqr;
select value
into result
from table
where value >= min and value <= max;
return result;
Upvotes: 2
Reputation: 8105
Mind using the ntile window function. It allows you to easily isolate extreme values from the result set.
Let's say you want to cut 10% from both sides of the result set. Then passing the value of 10 to ntile
and looking for values between 2 and 9 would give you the desired result. Keep also in mind that if you have less than 10 records, you might accidentally cut more than 20%, so be sure to check the total amount of records as well.
WITH yyy AS (
SELECT
id,
value,
NTILE(10) OVER (ORDER BY value) AS ntiled,
COUNT(*) OVER () AS counted
FROM
xxx)
SELECT
*
FROM
yyy
WHERE
counted < 10 OR ntiled BETWEEN 2 AND 9;
Upvotes: 2
Reputation: 154484
Here's an aggregate function which will calculate the trimmed mean for a set of values, excluding values outside N standard deviations from the mean.
Example:
DROP TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo (x FLOAT);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);
INSERT INTO foo VALUES (3);
INSERT INTO foo VALUES (4);
INSERT INTO foo VALUES (100);
SELECT avg(x), tmean(x, 2), tmean(x, 1.5) FROM foo;
-- avg | tmean | tmean
-- -----+-------+-------
-- 22 | 22 | 2.5
Code:
DROP TYPE IF EXISTS tmean_stype CASCADE; CREATE TYPE tmean_stype AS ( deviations FLOAT, count INT, acc FLOAT, acc2 FLOAT, vals FLOAT[] ); CREATE OR REPLACE FUNCTION tmean_sfunc(tmean_stype, float, float) RETURNS tmean_stype AS $$ SELECT $3, $1.count + 1, $1.acc + $2, $1.acc2 + ($2 * $2), array_append($1.vals, $2); $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION tmean_finalfunc(tmean_stype) RETURNS float AS $$ DECLARE fcount INT; facc FLOAT; mean FLOAT; stddev FLOAT; lbound FLOAT; ubound FLOAT; val FLOAT; BEGIN mean := $1.acc / $1.count; stddev := sqrt(($1.acc2 / $1.count) - (mean * mean)); lbound := mean - stddev * $1.deviations; ubound := mean + stddev * $1.deviations; -- RAISE NOTICE 'mean: % stddev: % lbound: % ubound: %', mean, stddev, lbound, ubound; fcount := 0; facc := 0; FOR i IN array_lower($1.vals, 1) .. array_upper($1.vals, 1) LOOP val := $1.vals[i]; IF val >= lbound AND val <= ubound THEN fcount := fcount + 1; facc := facc + val; END IF; END LOOP; IF fcount = 0 THEN return NULL; END IF; RETURN facc / fcount; END; $$ LANGUAGE plpgsql; CREATE AGGREGATE tmean(float, float) ( SFUNC = tmean_sfunc, STYPE = tmean_stype, FINALFUNC = tmean_finalfunc, INITCOND = '(-1, 0, 0, 0, {})' );
Gist (which should be identical): https://gist.github.com/4458294
Upvotes: 4
Reputation: 342
I cannot say; If a value is over X, it has to be eliminated.
Well, you could use having and a subselect to eliminate outliers, something like:
HAVING value < (
SELECT 2 * avg(value)
FROM mytable
GROUP BY ...
)
(Or, for that matter, use a more complex version to eliminate anything above 2 or 3 standard deviations if you want something that will be better at eliminating only outliers.)
The other option is to look at generating a median value, which is a fairly statistically sound way of accounting for outliers; happily there are three reasonable examples of just that: one from the Postgresql Wiki, one built as an Oracle compatability layer, and another from the PostgreSQL Journal. Note the caveats around how precisely/accurately they implement medians.
Upvotes: 8
Reputation: 35331
Postgresql can also calculate the standard deviation.
You could take only the data points which are in the average() +/- 2*stddev() which would roughly correspond to the 90% datapoints closest to the average.
Of course 2 can also be 3 (95%) or 6 (99.995%) but do not get hung up on the numbers because in the presence of a collection outliers you are no longer dealing with a normal distribution.
Be very careful and validate that it works as expected.
Upvotes: 14