Custom Aggregate in postgres using multiple columns

Question

Disclaimer: Solution below

I have a set of records from a query with columns (idx, time, category, weight, distance):

idx is an INTEGER value describing some sort of relation
time is TIMESTAMP WITHOUT TIMEZONE that can take (almost) arbitrary values, but each value will be present many times (for every idx and category)
category is VARCHAR and a categorical variable; its values are limited and will recur quite often
weight is DOUBLE PRECISION
distance is some precalculated value

The rows can look like the following:

(1, '2017-01-01 00:00', 'class_a', 1, 234.5)
(1, '2017-01-01 00:00', 'class_a', 1, 987.1)
(1, '2017-01-01 00:00', 'class_a', 1, 1.23)
(1, '2017-01-01 00:00', 'class_b', 1, 48.5)
(2, '2017-01-01 00:00', 'class_a', 1, 8763.5)
(1, '2017-01-01 00:13', 'class_a', 1, 598.02)
(1, '2017-01-01 00:13', 'class_b', 1, 76.9)
...
(2, '2017-01-27 21:07', 'class_b', 1, 184.0)

What's the question?

I'm seeking for a solution to calculate a custom aggregate on such data, but I can hardly find any instructions or examples on how this is actually done (hopefully possible without writing a C extension to postgres).

SELECT
  idx, time, category,
  weighted_density(
    value, distance, 10000.0 -- arbitrary 10k is explained below
  ) AS wd
FROM (my rows as shown above)
GROUP BY
  idx, time, category

I feel like setting up a custom aggregate (named WEIGHTED_DENSITY here) should be the right way to achieve something like the outlined query. My aim is to end up with a result set where the compound (idx, time, category) is unique, and its wd is calculated using all weight and distance values from the related rows.

Disclaimer: Solution below

What have I tried so far?

First, I was getting the entire rows from the database and calculated the aggregate off-line using another program and language (python). But this is quite resource consuming and should be run on the database server instead of the local machine (also to ensure integrity).

Then, I tried setting up a postgres function to calculate the result value using a single row:

CREATE OR REPLACE FUNCTION _gaussian_density(
    IN DOUBLE PRECISION, -- the weight
    IN DOUBLE PRECISION, -- the distance
    IN DOUBLE PRECISION  -- the maximum distance
  ) RETURNS DOUBLE PRECISION AS
$BODY$
BEGIN
  -- calculate weighted density, using max distance;
  -- this calculation itself doesn't really matter; it's some sort
  -- of density using a cropped gaussian kernel, for those who ask.
  RETURN
    CASE
      WHEN ABS($2) > ABS($3) THEN 0.0
      WHEN ABS($2) <= 0.0 THEN 1.0
      ELSE
        $1 * (
          1.0 / |/ (2.0 * PI())
        ) * POWER(EXP(-1 * (3.0 * ABS($2) / ABS($3))), 2)
        / 0.4
    END;
END
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 10;

Also, to make the function available as an aggregate, I tried

CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
    sfunc = _gaussian_density,
    stype = DOUBLE PRECISION,
    initcond = 0.0
);

But that's where I'm stuck, I just can't get it right and it seems I need an example or a little bit of a hint that pushes me into the right direction on how custom aggregates are properly created and made use of.

Cheers to you guys and thanks a lot in advance!

SOLUTION

Thanks @klin for pointing out that I missed to carry the aggregate state along. Now, this finally works:

CREATE FUNCTION _gaussian_density(
    weight FLOAT8,
    distance FLOAT8,
    maxdist FLOAT8
  )
RETURNS FLOAT8
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql
AS $$
  DECLARE
    abs_weight FLOAT8;
    abs_distance FLOAT8;
    abs_maxdist FLOAT8;
    dist_weight FLOAT8;
  BEGIN
    -- calculate weighted density, using max distance;
    -- this calculation itself doesn't really matter; it's some sort
    -- of density using a cropped gaussian kernel, for the curious
    abs_weight := ABS(COALESCE(weight, 1.0));
    abs_distance := ABS(COALESCE(distance, 0.0));
    abs_maxdist := ABS(COALESCE(maxdist, 0.0));
    IF abs_distance > abs_maxdist THEN RETURN 0.0; END IF;
    IF abs_distance <= 0.0 THEN RETURN 1.0 * abs_weight; END IF;
    RETURN abs_weight * (
            1.0 / |/ (2.0 * PI())
          ) * POWER(EXP(-1 * (3.0 * abs_distance / abs_maxdist)), 2)
          / 0.4;
  END;
$$;

CREATE FUNCTION _gaussian_statetransition(
    agg_state FLOAT8, -- carry the state!
    weight FLOAT8,
    distance FLOAT8,
    maxdist FLOAT8)
RETURNS FLOAT8
IMMUTABLE
LANGUAGE plpgsql
AS $$
  BEGIN
    RETURN
      agg_state + _gaussian_density(weight, distance, maxdist);
  END;
$$;

CREATE AGGREGATE weighted_density(FLOAT8, FLOAT8, FLOAT8)
(
    sfunc = _gaussian_statetransition,
    stype = FLOAT8,
    initcond = 0
);

I wanted to be able to still use the density calculation function outside of an aggregate, so I decided to add another function for the state transition which in turn employs the function _gaussian_density.

The aggregate then defines the state type and its initial state and we're good to go. In order to correctly handle some edge cases, I adapted _gaussian_density a bit (also towards dealing with NULL values)..

Thanks a lot!

klin · Accepted Answer

The function _gaussian_density() should depend on the value calculated in the previous step. If in your case this is the first argument weight then the initial condition should not be 0, as all next calculations will give zero as a result. I assume that the initial value of weight is 1.0:

DROP AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION);
CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
    sfunc = _gaussian_density,
    stype = DOUBLE PRECISION,
    initcond = 1.0 -- !!
);

Note that the aggregate does not use the column weight of the table as it is internal-state value, for which only the initial condition should be declared and which is returned as a final result.

SELECT
    idx, time, category,
    weighted_density(distance, 10000) AS wd -- !!
FROM my_table
GROUP BY idx, time, category  
ORDER BY idx, time, category;

 idx |        time         | category |         wd          
-----+---------------------+----------+---------------------
   1 | 2017-01-01 00:00:00 | class_a  |   0.476331421206002
   1 | 2017-01-01 00:00:00 | class_b  |   0.968750868953701
   1 | 2017-01-01 00:13:00 | class_a  |    0.69665860026144
   1 | 2017-01-01 00:13:00 | class_b  |   0.952383202706387
   2 | 2017-01-01 00:00:00 | class_a  | 0.00519142111518706
   2 | 2017-01-27 21:07:00 | class_b  |   0.893107967346503
(6 rows)

I am not sure I have correctly read your intentions, however my remarks should put you on the right path.

Custom Aggregate in postgres using multiple columns

Disclaimer: Solution below

What's the question?

Disclaimer: Solution below

What have I tried so far?

SOLUTION

Answers (1)

Related Questions