Reputation: 7647
I need to run some statistical analysis on intervals i.e. difference between two datetime fields in a table.
According to the aggregate function documentation here. The aggregate
functions for MAX
, MIN
, AVG
etc i.e. general aggregate functions accept arguments for date-time and interval fields.
However for more advanced statistical functions like stddev_pop
, var_pop
, var_sam
and std_samp
the input supported seem to be numeric or similar only. Although the documentation suggests there is no difference between the two types of functions
... (These are separated out merely to avoid cluttering the listing of more-commonly-used aggregates.)...
Is there any easy way to calculate these parameters ? and why is interval type as argument not supported ? These types of statistical aggregate functions are unit invariant ?
P.S. I am not sure I can extract epoch and use it, as some values may be negative.
Upvotes: 2
Views: 2079
Reputation: 95642
As I said in a comment, to work out sample standard deviation manually, at some point you multiply an interval by an interval. PostgreSQL doesn't support that.
To work around that issue, reduce the interval to hours or minutes or seconds (or whatever). This turns out to be a lot simpler than working out the calculation manually, and it suggests why PostgreSQL doesn't support this kind of calculation out of the box.
First, a function from the PostgreSQL general mailing list
CREATE OR REPLACE FUNCTION interval_to_seconds(interval)
RETURNS double precision AS $$
SELECT (extract(days from $1) * 86400)
+ (extract(hours from $1) * 3600)
+ (extract(minutes from $1) * 60)
+ extract(seconds from $1);
$$ LANGUAGE SQL;
Now we can take the standard deviation of a simple set of intervals.
with intervals (i) as (
values (interval '1 hour'), (interval '2 hour'), (interval '3 hour'),
(interval '4 hour'), (interval '5 hour')
)
, intervals_as_seconds as (
select interval_to_seconds(i) as seconds
from intervals
)
select stddev(seconds), stddev(seconds)/60
from intervals_as_seconds
in_sec in_min double precision double precision -- 5692.09978830308 94.8683298050514
You can verify the results however you like.
Now let's say you wanted hour granularity instead of seconds. Clearly, the choice of granularity is highly application dependent. You might define another function, interval_to_hours(interval)
. You can use a very similar query to calculate the standard deviation.
with intervals (i) as (
values (interval '1 hour'), (interval '2 hour'), (interval '3 hour'),
(interval '4 hour'), (interval '5 hour')
)
, intervals_as_hours as (
select interval_to_hours(i) as hours
from intervals
)
select stddev(hours) as stddev_in_hrs
from intervals_as_hours
stddev_in_hrs double precision -- 1.58113883008419
The value for standard deviation in hours is clearly different from the value in minutes or in seconds. But they measure exactly the same thing. The point is that the "right" answer depends on the granularity (units) you want to use, and there are a lot of choices. (From microseconds to centuries, I imagine.)
Also, consider this statement.
select interval_to_hours(interval '45 minutes')
interval_to_hours double precision -- 0
Is that the right answer? You can't say; the right answer is application-dependent. I can imagine applications that would want 45 minutes to be considered as 1 hour. I can also imagine applications that would want 45 minutes to be considered as 1 hour for some calculations, and as 0 hours for other calculations.
And think about this question. How many seconds are in a month? The expression select interval '1' month;
is valid; the number of seconds depends on how many days are in the month.
And I think that's why PostgreSQL doesn't support this kind of calculation out of the box. The right way to do it with interval arguments is too application-dependent.
Later . . .
I found this discussion on one of the PostgreSQL mailing lists.
Upvotes: 4