Reputation: 5210
I have a set of records from a query with columns (idx, time, category, weight, distance)
:
idx
is an INTEGER
value describing some sort of relationtime
is TIMESTAMP WITHOUT TIMEZONE
that can take (almost) arbitrary values, but each value will be present many times (for every idx
and category
)category
is VARCHAR
and a categorical variable; its values are limited and will recur quite oftenweight
is DOUBLE PRECISION
distance
is some precalculated valueThe rows can look like the following:
(1, '2017-01-01 00:00', 'class_a', 1, 234.5)
(1, '2017-01-01 00:00', 'class_a', 1, 987.1)
(1, '2017-01-01 00:00', 'class_a', 1, 1.23)
(1, '2017-01-01 00:00', 'class_b', 1, 48.5)
(2, '2017-01-01 00:00', 'class_a', 1, 8763.5)
(1, '2017-01-01 00:13', 'class_a', 1, 598.02)
(1, '2017-01-01 00:13', 'class_b', 1, 76.9)
...
(2, '2017-01-27 21:07', 'class_b', 1, 184.0)
I'm seeking for a solution to calculate a custom aggregate on such data, but I can hardly find any instructions or examples on how this is actually done (hopefully possible without writing a C extension to postgres).
SELECT
idx, time, category,
weighted_density(
value, distance, 10000.0 -- arbitrary 10k is explained below
) AS wd
FROM (my rows as shown above)
GROUP BY
idx, time, category
I feel like setting up a custom aggregate (named WEIGHTED_DENSITY
here) should be the right way to achieve something like the outlined query. My aim is to end up with a result set where the compound (idx, time, category)
is unique, and its wd
is calculated using all weight
and distance
values from the related rows.
First, I was getting the entire rows from the database and calculated the aggregate off-line using another program and language (python). But this is quite resource consuming and should be run on the database server instead of the local machine (also to ensure integrity).
Then, I tried setting up a postgres function to calculate the result value using a single row:
CREATE OR REPLACE FUNCTION _gaussian_density(
IN DOUBLE PRECISION, -- the weight
IN DOUBLE PRECISION, -- the distance
IN DOUBLE PRECISION -- the maximum distance
) RETURNS DOUBLE PRECISION AS
$BODY$
BEGIN
-- calculate weighted density, using max distance;
-- this calculation itself doesn't really matter; it's some sort
-- of density using a cropped gaussian kernel, for those who ask.
RETURN
CASE
WHEN ABS($2) > ABS($3) THEN 0.0
WHEN ABS($2) <= 0.0 THEN 1.0
ELSE
$1 * (
1.0 / |/ (2.0 * PI())
) * POWER(EXP(-1 * (3.0 * ABS($2) / ABS($3))), 2)
/ 0.4
END;
END
$BODY$
LANGUAGE plpgsql VOLATILE
COST 10;
Also, to make the function available as an aggregate, I tried
CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
sfunc = _gaussian_density,
stype = DOUBLE PRECISION,
initcond = 0.0
);
But that's where I'm stuck, I just can't get it right and it seems I need an example or a little bit of a hint that pushes me into the right direction on how custom aggregates are properly created and made use of.
Cheers to you guys and thanks a lot in advance!
Thanks @klin for pointing out that I missed to carry the aggregate state along. Now, this finally works:
CREATE FUNCTION _gaussian_density(
weight FLOAT8,
distance FLOAT8,
maxdist FLOAT8
)
RETURNS FLOAT8
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql
AS $$
DECLARE
abs_weight FLOAT8;
abs_distance FLOAT8;
abs_maxdist FLOAT8;
dist_weight FLOAT8;
BEGIN
-- calculate weighted density, using max distance;
-- this calculation itself doesn't really matter; it's some sort
-- of density using a cropped gaussian kernel, for the curious
abs_weight := ABS(COALESCE(weight, 1.0));
abs_distance := ABS(COALESCE(distance, 0.0));
abs_maxdist := ABS(COALESCE(maxdist, 0.0));
IF abs_distance > abs_maxdist THEN RETURN 0.0; END IF;
IF abs_distance <= 0.0 THEN RETURN 1.0 * abs_weight; END IF;
RETURN abs_weight * (
1.0 / |/ (2.0 * PI())
) * POWER(EXP(-1 * (3.0 * abs_distance / abs_maxdist)), 2)
/ 0.4;
END;
$$;
CREATE FUNCTION _gaussian_statetransition(
agg_state FLOAT8, -- carry the state!
weight FLOAT8,
distance FLOAT8,
maxdist FLOAT8)
RETURNS FLOAT8
IMMUTABLE
LANGUAGE plpgsql
AS $$
BEGIN
RETURN
agg_state + _gaussian_density(weight, distance, maxdist);
END;
$$;
CREATE AGGREGATE weighted_density(FLOAT8, FLOAT8, FLOAT8)
(
sfunc = _gaussian_statetransition,
stype = FLOAT8,
initcond = 0
);
I wanted to be able to still use the density calculation function outside of an aggregate, so I decided to add another function for the state transition which in turn employs the function _gaussian_density
.
The aggregate then defines the state type and its initial state and we're good to go. In order to correctly handle some edge cases, I adapted _gaussian_density
a bit (also towards dealing with NULL
values)..
Thanks a lot!
Upvotes: 0
Views: 1494
Reputation: 121754
The function _gaussian_density()
should depend on the value calculated in the previous step. If in your case this is the first argument weight
then the initial condition should not be 0, as all next calculations will give zero as a result. I assume that the initial value of weight
is 1.0:
DROP AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION);
CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
sfunc = _gaussian_density,
stype = DOUBLE PRECISION,
initcond = 1.0 -- !!
);
Note that the aggregate does not use the column weight
of the table as it is internal-state value, for which only the initial condition should be declared and which is returned as a final result.
SELECT
idx, time, category,
weighted_density(distance, 10000) AS wd -- !!
FROM my_table
GROUP BY idx, time, category
ORDER BY idx, time, category;
idx | time | category | wd
-----+---------------------+----------+---------------------
1 | 2017-01-01 00:00:00 | class_a | 0.476331421206002
1 | 2017-01-01 00:00:00 | class_b | 0.968750868953701
1 | 2017-01-01 00:13:00 | class_a | 0.69665860026144
1 | 2017-01-01 00:13:00 | class_b | 0.952383202706387
2 | 2017-01-01 00:00:00 | class_a | 0.00519142111518706
2 | 2017-01-27 21:07:00 | class_b | 0.893107967346503
(6 rows)
I am not sure I have correctly read your intentions, however my remarks should put you on the right path.
Upvotes: 2