Reputation: 5885
I have written an AGGREGATE
function that approximates a SELECT COUNT(DISTINCT ...)
over a UUID
column, a kind of poor man's HyperLogLog (and having different perf characteristics).
However, it is very slow because I am using set_bit
on a BIT
and that has copy-on-write semantics.
So my question is:
BIT
or bytea
?set_bit
edits?A constraint is that I can't push C code or extensions to implement this. But I can use extensions that are available in AWS RDS postgres. If it's not faster than HLL then I'll just be using HLL. Note that HLL is optimised for pre-aggregated counts, it isn't terribly fast at doing adhoc count estimates over millions of rows (although still faster than a raw COUNT DISTINCT
).
Below is the code for context, probably buggy too:
CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_sfunc (BIT(83886080), UUID)
RETURNS BIT(83886080) AS $$
DECLARE
s BIT(83886080) := $1;
BEGIN
IF s IS NULL THEN s := '0' :: BIT(83886080); END IF;
RETURN set_bit(s, abs(mod(uuid_hash($2), 83886080)), 1);
END
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION uuid_approx_count_distinct_ffunc (BIT(83886080))
RETURNS INTEGER AS $$
DECLARE
i INTEGER := 83886079;
s INTEGER := 0;
BEGIN
LOOP
EXIT WHEN i < 0;
IF get_bit($1, i) = 1 THEN s := s + 1; END IF;
i := i - 1;
END LOOP;
RETURN s;
END
$$ LANGUAGE plpgsql;
CREATE OR REPLACE AGGREGATE approx_count_distinct (UUID) (
SFUNC = uuid_approx_count_distinct_sfunc,
FINALFUNC = uuid_approx_count_distinct_ffunc,
STYPE = BIT(83886080),
COMBINEFUNC = bitor,
PARALLEL = SAFE
);
Upvotes: 3
Views: 342
Reputation: 46
Yeah, SQL isn't actually that fast for raw computation. I might try a UDF, perhaps pljava or plv8 (JavaScript) which compile just-in-time to native and available on most major hosting providers. Of course for performance, use C (perhaps via LLVM) for maximum performance at maximum pain. Plv8 should take minutes to prototype, just pass an array constructed from array_agg(). Obviously keep the array size to millions of items, or find a way to roll-up your sketches ( bitwuse-AND ?) https://plv8.github.io/#function-calls https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-array_agg-function/
FYI HyperLogLog is available as an open source extension for PostgreSQL from Citus/Microsoft and of course available on Azure. https://www.google.com/search?q=hyperloglog+postgres (You could crib from their coffee and just change the core algorithm, then test side by side). Citus is pretty easy to install, so this isn't a bad option.
Upvotes: 2