KullDox
KullDox

Reputation: 353

How to generate a hash of the result set in Postgress?

I have two databases for logging stuff, which I want to check for synchronization.

The solution approved is to send periodically (lets say hourly) a select to both, generate a hash of the result set and compare them. If they match then great, otherwise generate some alarms.

Currently I'm doing it by (bash script):

 log_table="SELECT column1, column2, column3 FROM log_table where to_char(timestamp, '$ts_format') = '$tx_moment'";
PSQL="psql -t -q -h $_gp_host -U $_gp_user -d log_schema -c ";
echo "`${PSQL} $tx_fix${log_table} | sort | cksum`";

I would like to do the cksum/hash on the postgres side. Because currently it downloads the whole result set (which can have even 25MB or more) and generates the checksum on the server side.

Google didn't help.

Any suggestions?

Thanks.

Upvotes: 2

Views: 2711

Answers (3)

Tom McClure
Tom McClure

Reputation: 6729

Postgres 14+ has a new aggregate function bit_xor(integer)

So this might perform better? And is not vulnerable to the 1GB limit on concatenated hashes. You can wrap it in md5 if you don't want to use the raw integer.

SELECT BIT_XOR(HASHTEXT(TEXTIN(RECORD_OUT(x)))) FROM (SELECT * FROM yourtable WHERE ...) x;

One nice thing about XOR is that the order of the rows will not affect the checksum.

For earlier versions of postgres you can define BIT_XOR as a custom aggregate:

CREATE AGGREGATE BIT_XOR(IN v smallint) (SFUNC = int2xor, STYPE = smallint);
CREATE AGGREGATE BIT_XOR(IN v int4) (SFUNC = int4xor, STYPE = int4);
CREATE AGGREGATE BIT_XOR(IN v bigint) (SFUNC = int8xor, STYPE = bigint);
CREATE AGGREGATE BIT_XOR(IN v bit) (SFUNC = bitxor, STYPE = bit);

Upvotes: 0

Magnus Hagander
Magnus Hagander

Reputation: 25098

If you want to do a hash over all of it at once, that's going to use up a lot of memory server-side as well. And once you hit 1Gb, it won't work anymore since a single string can't be longer than that.

Perhaps something like this will work, which basically does a hash of each row, and then hashes those hashes. It will still break whe nthe length of hashes go above 1Gb - you'll need to write a custom md5 aggregate to get around that.

SELECT md5(concat(md5(column1 || column2 || column3))) FROM log_table WHERE ...

This requires that you have created the custom aggregate concat like this:

CREATE AGGREGATE concat (
    BASETYPE = text,
    SFUNC = textcat,
    STYPE = text,
    INITCOND = ''
);

Upvotes: 1

Frank Heikens
Frank Heikens

Reputation: 127116

You could use md5:

 log_table="
SELECT 
  md5(column1 || column2 || column3) AS hash,
  column1, column2, column3
FROM log_table where to_char(timestamp, '$ts_format') = '$tx_moment'";

Upvotes: 1

Related Questions