Reputation: 611
I need to store large files (from several MB to 1GB) in Postgres database. The database has multiple schemas. It looks like Postgres has 2 options to store large objects: LOB and BYTEA. However we seem to hit problems with each of these options.
LOB. This works almost ideal, can store up to 2GB and allows streaming so that we do not hit memory limits in our PHP backend when reading the LOB. However all blobs are stored in pg_catalog and are not part of schema. This leads to a big problem when you try to use pg_dump with options –n and –b to dump just one schema with its blobs. It dumps the schema data correctly however then it includes ALL blobs in the database not just the blobs that belong to the particular schema. Is there a way to dump the single schema with its blobs using pg_dump or some other utility?
BYTEA. These are correctly stored per schema so pg_dump –n works correctly however I cannot seem to find a way to stream the data. This means that there is no way to access the data from PHP if it is larger than memory limit.
Is there any other way to store large data in Postgres that allows streaming and correctly works with multiple schemas per database?
Thanks.
Upvotes: 2
Views: 1605
Reputation: 61606
The pg_catalog.pg_largeobject system table where the large objects are actually stored is essentially a list of per-object bytea chunks ordered by pageno, which is a sequential chunk number from 0 to N.
Table "pg_catalog.pg_largeobject" Column | Type | Modifiers --------+---------+----------- loid | oid | not null pageno | integer | not null data | bytea | Indexes: "pg_largeobject_loid_pn_index" UNIQUE, btree (loid, pageno)
The maximum size of these chunks is 2048 bytes (it can be changed but at the cost of a server recompilation), which is quite small for blobs of several hundred megabytes.
So one option in your case would be to replicate a similar structure in your own schema, probably with larger chunks, and implement stream-like access by iterating over a list of pageno
. Having smaller column contents is better in general, anyway. For example, it's not obvious that pg_dump
deals nicely with large bytea contents in a single row in terms of client-side memory requirements.
Upvotes: 1
Reputation: 133662
Although using bytea doesn't support a streaming/file-style API, you can use it to fetch only parts of the content, so it supports "chunking".
You need to set the storage mode of your bytea column to 'external' to disable compression, and then you can use substring
on the bytea column to fetch only a part of it. At least according to the documentation, this will DTRT and efficiently access only the requisite part of the value on the database side:
http://www.postgresql.org/docs/current/static/storage-toast.html
So create a schema a bit like this:
create table media.entity(entity_id serial primary key, content bytea not null);
alter table media.entity alter column content set storage external;
And then to fetch 8Kb from the content:
select substring(content, 1, 8192) from media.entity where entity_id = 1;
select substring(content, 8193, 8192) from media.entity where entity_id = 1;
Unfortunately, fetches of TOAST data don't seem to be counted in the explain (buffers on)
counts so it's hard to verify that the db is doing what the documentation says.
Upvotes: 3