fommil
fommil

Reputation: 5885

Is there a postgres function to mutably update a binary data structure?

I have written an AGGREGATE function that approximates a SELECT COUNT(DISTINCT ...) over a UUID column, a kind of poor man's HyperLogLog (and having different perf characteristics).

However, it is very slow because I am using set_bit on a BIT and that has copy-on-write semantics.

So my question is:

  1. is there a way to inplace / mutably update a BIT or bytea?
  2. failing that, are there any binary data structures that allow mutable/in-place set_bit edits?

A constraint is that I can't push C code or extensions to implement this. But I can use extensions that are available in AWS RDS postgres. If it's not faster than HLL then I'll just be using HLL. Note that HLL is optimised for pre-aggregated counts, it isn't terribly fast at doing adhoc count estimates over millions of rows (although still faster than a raw COUNT DISTINCT).

Below is the code for context, probably buggy too:

CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_sfunc (BIT(83886080), UUID)
RETURNS BIT(83886080) AS $$
DECLARE
   s BIT(83886080) := $1;
BEGIN
   IF s IS NULL THEN s := '0' :: BIT(83886080); END IF;
   RETURN set_bit(s, abs(mod(uuid_hash($2), 83886080)), 1);
END
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_ffunc (BIT(83886080))
RETURNS INTEGER AS $$
DECLARE
   i INTEGER := 83886079;
   s INTEGER := 0;
BEGIN
   LOOP
     EXIT WHEN i < 0;
     IF get_bit($1, i) = 1 THEN s := s + 1; END IF;
     i := i - 1;
   END LOOP;
   RETURN s;
END
$$ LANGUAGE plpgsql;

CREATE OR REPLACE AGGREGATE approx_count_distinct (UUID) (
  SFUNC = uuid_approx_count_distinct_sfunc,
  FINALFUNC = uuid_approx_count_distinct_ffunc,
  STYPE = BIT(83886080),
  COMBINEFUNC = bitor,
  PARALLEL = SAFE
);

Upvotes: 3

Views: 342

Answers (1)

asah
asah

Reputation: 46

Yeah, SQL isn't actually that fast for raw computation. I might try a UDF, perhaps pljava or plv8 (JavaScript) which compile just-in-time to native and available on most major hosting providers. Of course for performance, use C (perhaps via LLVM) for maximum performance at maximum pain. Plv8 should take minutes to prototype, just pass an array constructed from array_agg(). Obviously keep the array size to millions of items, or find a way to roll-up your sketches ( bitwuse-AND ?) https://plv8.github.io/#function-calls https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-array_agg-function/

FYI HyperLogLog is available as an open source extension for PostgreSQL from Citus/Microsoft and of course available on Azure. https://www.google.com/search?q=hyperloglog+postgres (You could crib from their coffee and just change the core algorithm, then test side by side). Citus is pretty easy to install, so this isn't a bad option.

Upvotes: 2

Related Questions