Donald Miner
Donald Miner

Reputation: 39903

Most compact and fastest way to store my string in PostgreSQL

I have a large hexidecimal (16 byte, 32 hex digits) data item that always has the format:

00d980113901429fa6de7fb7e2da705a

This is coming in as an ASCII string from my source (i.e., the zero above is character zero 0x30, not 0x00), and I would like to know peoples' opinions on the best way (irt storage and speed) to store this in PostgreSQL.

The obvious thing to do is to just store it as a varchar, but storing it in a binary form would definitely save space. Would I see performance gains from select and insert by storing it in a binary form? Would bytea or bit be better? Is there a difference between these two in terms of internal representation?

Another idea would be to store it as two bigint/int8 or four integer/int4, split up into multiple columns.

Space and time are an issue as I have MANY of these (upwards of a trillion).

Upvotes: 3

Views: 2072

Answers (3)

Julius Musseau
Julius Musseau

Reputation: 4155

I suspect BYTEA will be 2x smaller for space, and 2x faster for comparisons (>, <, =) compared to a VARCHAR representation.

In other database engines you can even avoid the length-header overhead. For example:

MS-SQL:   BINARY(16)
Oracle:   RAW(16)
MySQL:    BINARY(16)

Or if you like length-headers:

MS-SQL:   VARBINARY(16)
Oracle:   BLOB
MySQL:    VARBINARY(16)

PostgreSQL only supports BYTEA, so you always pay for the length-header, but I still go with BYTEA in this situation.

Upvotes: 1

Tometzky
Tometzky

Reputation: 23890

Compare these two tables of 10M records:

create table test (a int8 not null, b int8 not null, primary key(a,b));
insert into test
  select generate_series(1,10000000), generate_series(1,10000000);
select pg_size_pretty(pg_total_relation_size('test'));
723 MB
create table test_bytea (a bytea not null);
insert into test_bytea
  select decode(lpad(to_hex(a),16,'0')||lpad(to_hex(b),16,'0'),'hex') from test;
alter table test_bytea add primary key (a);
select pg_size_pretty(pg_total_relation_size('test_bytea'));
804 MB

A bytea with index is 11% bigger than 2*int8. This isn't much, but it means that 11% less rows will be in cache. And sequentional scans will be 11% slower etc.

If your data does not change maybe you should consider a flat file storage of sorted values instead of database - this will be only 152MB per 10M records and searching will be O(log(n)).

Upvotes: 3

OMG Ponies
OMG Ponies

Reputation: 332591

You have to determine what the most common use for the data is, in order to determine the appropriate data type. A conversion away from the data type means that an index referencing the column is useless.

Upvotes: 1

Related Questions