Reputation: 1560
For various reasons that don't matter here we are storing chunks of text in either an encrypted or base64 encoded format in PostgreSQL. However, we want to be able to use PostgreSQL's fulltext search to find and return data which in its unencrypted/decoded form matches a search query.
How would one go about accomplishing this? I've seen other posts mention the ability to build the tsvector values before sending data to the database, but I was hoping there would be something available on the Postgres end of things (at least for the base64 text).
Upvotes: 5
Views: 2980
Reputation: 324275
For encrypted values you can't. Even if you created the tsvector
client-side, the tsvector would contain a form of the encrypted text so it wouldn't be acceptable for most applications. Observe:
regress=> SELECT to_tsvector('my secret password is CandyStrip3r');
to_tsvector
------------------------------------------
'candystrip3r':5 'password':3 'secret':2
(1 row)
... whoops. It doesn't matter if you create that value client side instead of using to_tsvector
, it'll still have your password in cleartext. You could encrypt the tsvector, but then you couldn't use it for fulltext seach.
Sure, given the encrypted value:
CREATE EXTENSION pgcrypto;
regress=> SELECT encrypt( convert_to('my s3kritPassw1rd','utf-8'), '\xdeadbeef', 'aes');
encrypt
--------------------------------------------------------------------
\x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff
(1 row)
you can (but shouldn't) do something like this:
regress=> SELECT to_tsvector( convert_from(decrypt('\x10441717bfc843677d2b76ac357a55ac5566ffe737105332552f98c2338480ff', '\xdeadbeef', 'aes'), 'utf-8') );
to_tsvector
--------------------
's3kritpassw1rd':2
(1 row)
... but if the problems with that aren't immediately obvious after scrolling right in the code display box then you should really be getting somebody else to do your security design for you ;-)
There's been tons of research on ways to perform operations on encrypted values without decrypting them, like adding two encrypted numbers together to produce a result that's encrypted with the same key, so the process doing the adding doesn't need the ability to decrypt the inputs in order to get the output. It's possible some of this could be applied to fts - but it's way beyond my level of expertise in the area and likely to be horribly inefficient and/or cryptographically weak anyway.
For base64 you decode
the base64 before feeding it into to_tsvector
. Because decode
returns a bytea
and you know the encoded data is text you need to use convert_from
to decode the bytea
into text
in the database encoding, eg:
regress=> SELECT encode(convert_to('some text to search','utf-8'), 'base64');
encode
------------------------------
c29tZSB0ZXh0IHRvIHNlYXJjaA==
(1 row)
regress=> SELECT to_tsvector(convert_from( decode('c29tZSB0ZXh0IHRvIHNlYXJjaA==', 'base64'), getdatabaseencoding() ));
to_tsvector
---------------------
'search':4 'text':2
(1 row)
In this case I've used the database encoding as the input to convert_from
, but you need to make sure you use the encoding that the underlying base64
encoded text was in. Your application is responsible for getting this right. I suggest either storing the encoding in a 2nd column or ensuring that your application always encodes the text as utf-8
before applying base64
encoding.
Upvotes: 7