jbndlr
jbndlr

Reputation: 5210

Custom Aggregate in postgres using multiple columns

Disclaimer: Solution below

I have a set of records from a query with columns (idx, time, category, weight, distance):

The rows can look like the following:

(1, '2017-01-01 00:00', 'class_a', 1, 234.5)
(1, '2017-01-01 00:00', 'class_a', 1, 987.1)
(1, '2017-01-01 00:00', 'class_a', 1, 1.23)
(1, '2017-01-01 00:00', 'class_b', 1, 48.5)
(2, '2017-01-01 00:00', 'class_a', 1, 8763.5)
(1, '2017-01-01 00:13', 'class_a', 1, 598.02)
(1, '2017-01-01 00:13', 'class_b', 1, 76.9)
...
(2, '2017-01-27 21:07', 'class_b', 1, 184.0)

What's the question?

I'm seeking for a solution to calculate a custom aggregate on such data, but I can hardly find any instructions or examples on how this is actually done (hopefully possible without writing a C extension to postgres).

SELECT
  idx, time, category,
  weighted_density(
    value, distance, 10000.0 -- arbitrary 10k is explained below
  ) AS wd
FROM (my rows as shown above)
GROUP BY
  idx, time, category

I feel like setting up a custom aggregate (named WEIGHTED_DENSITY here) should be the right way to achieve something like the outlined query. My aim is to end up with a result set where the compound (idx, time, category) is unique, and its wd is calculated using all weight and distance values from the related rows.

Disclaimer: Solution below

What have I tried so far?

First, I was getting the entire rows from the database and calculated the aggregate off-line using another program and language (python). But this is quite resource consuming and should be run on the database server instead of the local machine (also to ensure integrity).

Then, I tried setting up a postgres function to calculate the result value using a single row:

CREATE OR REPLACE FUNCTION _gaussian_density(
    IN DOUBLE PRECISION, -- the weight
    IN DOUBLE PRECISION, -- the distance
    IN DOUBLE PRECISION  -- the maximum distance
  ) RETURNS DOUBLE PRECISION AS
$BODY$
BEGIN
  -- calculate weighted density, using max distance;
  -- this calculation itself doesn't really matter; it's some sort
  -- of density using a cropped gaussian kernel, for those who ask.
  RETURN
    CASE
      WHEN ABS($2) > ABS($3) THEN 0.0
      WHEN ABS($2) <= 0.0 THEN 1.0
      ELSE
        $1 * (
          1.0 / |/ (2.0 * PI())
        ) * POWER(EXP(-1 * (3.0 * ABS($2) / ABS($3))), 2)
        / 0.4
    END;
END
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 10;

Also, to make the function available as an aggregate, I tried

CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
    sfunc = _gaussian_density,
    stype = DOUBLE PRECISION,
    initcond = 0.0
);

But that's where I'm stuck, I just can't get it right and it seems I need an example or a little bit of a hint that pushes me into the right direction on how custom aggregates are properly created and made use of.

Cheers to you guys and thanks a lot in advance!

SOLUTION

Thanks @klin for pointing out that I missed to carry the aggregate state along. Now, this finally works:

CREATE FUNCTION _gaussian_density(
    weight FLOAT8,
    distance FLOAT8,
    maxdist FLOAT8
  )
RETURNS FLOAT8
IMMUTABLE
CALLED ON NULL INPUT
LANGUAGE plpgsql
AS $$
  DECLARE
    abs_weight FLOAT8;
    abs_distance FLOAT8;
    abs_maxdist FLOAT8;
    dist_weight FLOAT8;
  BEGIN
    -- calculate weighted density, using max distance;
    -- this calculation itself doesn't really matter; it's some sort
    -- of density using a cropped gaussian kernel, for the curious
    abs_weight := ABS(COALESCE(weight, 1.0));
    abs_distance := ABS(COALESCE(distance, 0.0));
    abs_maxdist := ABS(COALESCE(maxdist, 0.0));
    IF abs_distance > abs_maxdist THEN RETURN 0.0; END IF;
    IF abs_distance <= 0.0 THEN RETURN 1.0 * abs_weight; END IF;
    RETURN abs_weight * (
            1.0 / |/ (2.0 * PI())
          ) * POWER(EXP(-1 * (3.0 * abs_distance / abs_maxdist)), 2)
          / 0.4;
  END;
$$;

CREATE FUNCTION _gaussian_statetransition(
    agg_state FLOAT8, -- carry the state!
    weight FLOAT8,
    distance FLOAT8,
    maxdist FLOAT8)
RETURNS FLOAT8
IMMUTABLE
LANGUAGE plpgsql
AS $$
  BEGIN
    RETURN
      agg_state + _gaussian_density(weight, distance, maxdist);
  END;
$$;

CREATE AGGREGATE weighted_density(FLOAT8, FLOAT8, FLOAT8)
(
    sfunc = _gaussian_statetransition,
    stype = FLOAT8,
    initcond = 0
);

I wanted to be able to still use the density calculation function outside of an aggregate, so I decided to add another function for the state transition which in turn employs the function _gaussian_density.

The aggregate then defines the state type and its initial state and we're good to go. In order to correctly handle some edge cases, I adapted _gaussian_density a bit (also towards dealing with NULL values)..

Thanks a lot!

Upvotes: 0

Views: 1494

Answers (1)

klin
klin

Reputation: 121754

The function _gaussian_density() should depend on the value calculated in the previous step. If in your case this is the first argument weight then the initial condition should not be 0, as all next calculations will give zero as a result. I assume that the initial value of weight is 1.0:

DROP AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION);
CREATE AGGREGATE weighted_density(DOUBLE PRECISION, DOUBLE PRECISION)
(
    sfunc = _gaussian_density,
    stype = DOUBLE PRECISION,
    initcond = 1.0 -- !!
);

Note that the aggregate does not use the column weight of the table as it is internal-state value, for which only the initial condition should be declared and which is returned as a final result.

SELECT
    idx, time, category,
    weighted_density(distance, 10000) AS wd -- !!
FROM my_table
GROUP BY idx, time, category  
ORDER BY idx, time, category;

 idx |        time         | category |         wd          
-----+---------------------+----------+---------------------
   1 | 2017-01-01 00:00:00 | class_a  |   0.476331421206002
   1 | 2017-01-01 00:00:00 | class_b  |   0.968750868953701
   1 | 2017-01-01 00:13:00 | class_a  |    0.69665860026144
   1 | 2017-01-01 00:13:00 | class_b  |   0.952383202706387
   2 | 2017-01-01 00:00:00 | class_a  | 0.00519142111518706
   2 | 2017-01-27 21:07:00 | class_b  |   0.893107967346503
(6 rows)    

I am not sure I have correctly read your intentions, however my remarks should put you on the right path.

Upvotes: 2

Related Questions