Reputation: 91
We'd like to use PostgreSQL to store documents. As some of them might be up to 2 GB big, we have to use the lob-Datatype, where the large objects are stored in a separate table (pg_largeobject
), referenced by an OID, as per the docs on large objects.
For us it's very important that these documents (.pdf
, .doc
, ...) can be searched. With the built-in PostgreSQL full text search tables and columns can be searched, but is it possible to search the large-objects in the pg_largeobject
table as well?
If not, we have to use Oracle.
Upvotes: 9
Views: 3948
Reputation: 116187
In PostgreSQL, full text search is mostly based on one very important function to_tsvector()
.
This function converts text document into tsvector
data type. Typically, tsvector
physical size is about 1% of original text.
All further text search methods are based on creating indexes on top of computed or physical tsvector
values. With this in mind, you can compute necessary tsvector
values anywhere (even on a client!) and save it in specially created column(s) for indexing purposes. In other words, you don't really have to store your original texts in database - tsvector
is all you need to tell which document contained text you are searching for (but it won't tell you location within single document).
What's cool about this, is that indexed document can be anything - even Word, Excel, PDF, whatever, as long as you can get textual content of that document and compute necessary tsvector
to be stored and indexed.
One small downside of recomputing tsvector
's (especially on client side) is that you have to recompute it every time you change your original document.
Upvotes: 3
Reputation: 324475
There are at least two issues here.
Full-text search doesn't really work on large objects stored as lob
or oid-references. You cannot full-text index the contents of pg_largeobject
.
Full-text search is an indexing system for text. It cannot index PDF, Microsoft Word documents, or other random binary files. It does not have provision for text-extraction tool callbacks, etc.
You can either:
Create a table that contains text extracted from those files using external tools along with an oid
that refers to the file its self, then full-text index that table of extracted text; or
Use a more powerful, full-featured external search system like Solr (based on Lucene) that's designed to cope with varying formats, do its own text extraction, etc.
Upvotes: 3