Encryption using Python and PostgreSQL

Question

We have a database that contains personally-identifying information (PII) that needs to be encrypted.

From the Python side, I can use PyCrypto to encrypt data using AES-256 and a variable salt; this results in a Base64 encoded string.

From the PostgreSQL side, I can use the PgCrypto functions to encrypt data in the same way, but this results in a bytea value.

For the life of me, I can't find a way to convert between these two, or to make a comparison between the two so that I can do a query on the encrypted data. Any suggestions/ideas?

Note: yes, I realize that I could do all the encryption/decryption on the database side, but my goal is to ensure that any data transmitted between the application and the database still does not contain any of the PII, as it could, in theory, be vulnerable to interception, or visible via logging.

Craig Ringer · Accepted Answer

Imagine you have a Social Security Number field in your table. Users must be able to query for a particular SSN when needed. The SSN, obviously, needs to be encrypted. I can encrypt it from the Python side and save it to the database, but then in order for it to be searchable, I would have to use the same salt for every record so that I can incorporate the encrypted value as part of my WHERE clause, and that just leaves us vulnerable. I can encrypt/decrypt on the database side, but in that case, I'm sending the SSN in plain-text whenever I'm querying, which is also bad.

The usual solution to this kind of issue is to store a partial value, hashed unsalted or with a fixed salt, alongside the randomly salted full value. You index the hashed partial value and search on that. You'll get false-positive matches, but still significantly benefit from DB-side indexed searching. You can fetch all the matches and, application-side, discard the false positives.

Querying encrypted data is all about compromises between security and performance. There's no magic answer that'll let you send a hashed value to the server and have it compare it to a bunch of randomly salted and hashed values for a match. In fact, that's exactly why we salt our hashes - to prevent that from working, because that's also pretty much what an attacker does when trying to brute-force.

So. Compromise. Either live with sending the SSNs as plaintext (over SSL) for comparison to salted & hashed stored values, knowing that it still greatly reduces exposure because the whole lot can't be dumped at once. Or index a partial value and search on that.

Do be aware that another problem with sending values unhashed is that they can appear in the server error logs. Even if you don't have log_statement = all, they may still appear if there's an error, like query cancellation or a deadlock break. Sending the values as query parameters reduces the number of places they can appear in the logs, but is far from foolproof. So if you send values in the clear you've got to treat your logs as security critical. Fun!

Encryption using Python and PostgreSQL

Answers (2)

Related Questions