user1265125
user1265125

Reputation: 2656

Storage issue with Key,Value datatypes, particularly Hstores in Postgresql

Say I have a table with 3 columns: varchar(20), hstore, smallint

Now if I insert the following: "ABCDEF", "abc=>123, xyz=>888, lmn=>102", 5

How much space will the record take in PostgreSQL? Is the hstore stored as plain text? So if I have a million records, the space taken by the keys (abc,xyz,lmn) will be duplicated across all the records?

I'm asking this because I have a use case in which I need to store an unknown number of key,value pairs; with each key taking upto 20B and the value not more than smallint range.

The catch is that the number of records is massive, around 90 million a day. And the number of Key,Value pairs are ~400. This quickly leads storage problems since just a day's data would total upto around 800GB; with a massive percentage being taken by the Keys which are duplicated across all records.

So considering there are 400 Key/Value pairs, a single Hstore in a record (if stored as plain text) would take 400*22 Bytes. Multiplied by 90 Million, that is 737GB. If stored in normal columns as 2 Byte ints, it will take just 67GB.

Are HStores not suitable for this use case? Do I have any option which can help me with this storage issue? I know this is a big ask and I might just have to go with a regular columnar storage solution and forgo the flexibility offered by key value.

Upvotes: 1

Views: 597

Answers (1)

Craig Ringer
Craig Ringer

Reputation: 324475

How much space will the record take in PostgreSQL?

To get the raw uncompressed size:

SELECT pg_column_size( ROW( 'ABCDEF', 'abc=>123, xyz=>888, lmn=>102'::hstore, 5) );

but due to TOAST compressed out-of-line storage that might not be the on-disk size... though it often is:

CREATE TABLE blah(col1 text, col2 hstore, col3 integer);

INSERT INTO blah (col1, col2, col3)
VALUES ('ABCDEF', 'abc=>123, xyz=>888, lmn=>102'::hstore, 5);

regress=> SELECT pg_column_size(blah) FROM blah;
 pg_column_size 
----------------
             84
(1 row)

If you used a bigger hstore value here it might get compressed and stored out of line. In that case, the size would depend on how compressible it is.

Is the hstore stored as plain text?

no, it's a binary format, but nor is it compressed; the keys/values are plain text.

So if I have a million records, the space taken by the keys (abc,xyz,lmn) will be duplicated across all the records?

Correct. Each hstore value is a standalone value. It has no relationship with any other value anywhere in the system. It's just like a text or json or whatever else. There's no sort of central key index or anything like that.

Demo:

CREATE TABLE hsdemo(hs hstore);

INSERT INTO hsdemo(hs) 
SELECT hstore(ARRAY['thisisthefirstkey', 'thisisanotherbigkey'], ARRAY[x::text, x::text]) 
FROM generate_series(1,10000) x;

SELECT pg_size_pretty(pg_relation_size('hsdemo'::regclass));
-- prints 992kb

INSERT INTO hsdemo(hs) 
SELECT hstore(ARRAY['thisisthefirstkey', 'thisisanotherbigkey'], ARRAY[x::text, x::text]) 
FROM generate_series(10000,20000) x;

SELECT pg_size_pretty(pg_relation_size('hsdemo'::regclass));
-- prints 1968kb, i.e. near doubling for double the records.

Thus, if you have many highly duplicated large keys and small values, you should probably look at a normalized schema (yes, even EAV).

However, be aware that PostgreSQL has quite a large per-row overhead of over 20 bytes per row, so you may not gain as much as you'd expect by storing huge numbers of short rows instead of something like a hstore.

You can always compromise - keep a lookup table of full key names, and associate it with a short hstore key. So your app essentially compresses the keys in each hstore.

Upvotes: 2

Related Questions