vera
vera

Reputation: 91

Postgresql: Full text search within lob possible?

We'd like to use PostgreSQL to store documents. As some of them might be up to 2 GB big, we have to use the lob-Datatype, where the large objects are stored in a separate table (pg_largeobject), referenced by an OID, as per the docs on large objects.

For us it's very important that these documents (.pdf, .doc, ...) can be searched. With the built-in PostgreSQL full text search tables and columns can be searched, but is it possible to search the large-objects in the pg_largeobject table as well?

If not, we have to use Oracle.

Upvotes: 9

Views: 3948

Answers (2)

mvp
mvp

Reputation: 116187

In PostgreSQL, full text search is mostly based on one very important function to_tsvector(). This function converts text document into tsvector data type. Typically, tsvector physical size is about 1% of original text.

All further text search methods are based on creating indexes on top of computed or physical tsvector values. With this in mind, you can compute necessary tsvector values anywhere (even on a client!) and save it in specially created column(s) for indexing purposes. In other words, you don't really have to store your original texts in database - tsvector is all you need to tell which document contained text you are searching for (but it won't tell you location within single document). What's cool about this, is that indexed document can be anything - even Word, Excel, PDF, whatever, as long as you can get textual content of that document and compute necessary tsvector to be stored and indexed.

One small downside of recomputing tsvector's (especially on client side) is that you have to recompute it every time you change your original document.

Upvotes: 3

Craig Ringer
Craig Ringer

Reputation: 324475

There are at least two issues here.

  1. Full-text search doesn't really work on large objects stored as lob or oid-references. You cannot full-text index the contents of pg_largeobject.

  2. Full-text search is an indexing system for text. It cannot index PDF, Microsoft Word documents, or other random binary files. It does not have provision for text-extraction tool callbacks, etc.

You can either:

  • Create a table that contains text extracted from those files using external tools along with an oid that refers to the file its self, then full-text index that table of extracted text; or

  • Use a more powerful, full-featured external search system like Solr (based on Lucene) that's designed to cope with varying formats, do its own text extraction, etc.

Upvotes: 3

Related Questions