Reputation: 39903
I have a large hexidecimal (16 byte, 32 hex digits) data item that always has the format:
00d980113901429fa6de7fb7e2da705a
This is coming in as an ASCII string from my source (i.e., the zero above is character zero 0x30, not 0x00), and I would like to know peoples' opinions on the best way (irt storage and speed) to store this in PostgreSQL.
The obvious thing to do is to just store it as a varchar, but storing it in a binary form would definitely save space. Would I see performance gains from select and insert by storing it in a binary form? Would bytea or bit be better? Is there a difference between these two in terms of internal representation?
Another idea would be to store it as two bigint/int8 or four integer/int4, split up into multiple columns.
Space and time are an issue as I have MANY of these (upwards of a trillion).
Upvotes: 3
Views: 2072
Reputation: 4155
I suspect BYTEA will be 2x smaller for space, and 2x faster for comparisons (>, <, =) compared to a VARCHAR representation.
In other database engines you can even avoid the length-header overhead. For example:
MS-SQL: BINARY(16) Oracle: RAW(16) MySQL: BINARY(16)
Or if you like length-headers:
MS-SQL: VARBINARY(16) Oracle: BLOB MySQL: VARBINARY(16)
PostgreSQL only supports BYTEA, so you always pay for the length-header, but I still go with BYTEA in this situation.
Upvotes: 1
Reputation: 23890
Compare these two tables of 10M records:
create table test (a int8 not null, b int8 not null, primary key(a,b));
insert into test
select generate_series(1,10000000), generate_series(1,10000000);
select pg_size_pretty(pg_total_relation_size('test'));
723 MB
create table test_bytea (a bytea not null);
insert into test_bytea
select decode(lpad(to_hex(a),16,'0')||lpad(to_hex(b),16,'0'),'hex') from test;
alter table test_bytea add primary key (a);
select pg_size_pretty(pg_total_relation_size('test_bytea'));
804 MB
A bytea
with index is 11% bigger than 2*int8
. This isn't much, but it means that 11% less rows will be in cache. And sequentional scans will be 11% slower etc.
If your data does not change maybe you should consider a flat file storage of sorted values instead of database - this will be only 152MB per 10M records and searching will be O(log(n)).
Upvotes: 3
Reputation: 332591
You have to determine what the most common use for the data is, in order to determine the appropriate data type. A conversion away from the data type means that an index referencing the column is useless.
Upvotes: 1