Marcus Junius Brutus
Marcus Junius Brutus

Reputation: 27286

String interning in the database

I am using a PostgreSQL database to log various aspects of a long-running system I have and one particular item I am logging is the CLASSPATH of the various sub-ordinate processes the system spawns .

Each of these CLASSPATHs can be a good one thousand characters, maybe more, and the system can fire up many thousand subordinate processes every day. So having a VARCHAR column to store all these CLASSPATHs can be a waste of space unless PostgreSQL somehow interns those strings (given that most of the time they will be the same).

Is there a way to deal with this at the DB level transparently to the application, without having to implement such interning (string reuse) logic at the application layer?

And, if so, is the interning support clever enough to recognize common parts (e.g. you can have millions of long VARCHAR or TEXT values that are not equal, yet only differ in a few lines)?

Upvotes: 3

Views: 1070

Answers (1)

harmic
harmic

Reputation: 30597

Postgresql does not internally intern strings.

You can break your current single table into two tables, one containing unique values for the CLASSPATH and the other containing the other columns. The second table uses a foreign key reference to refer to the values in the first. Something like this:

create table paths(
    id serial primary key, 
    classpath text unique
);

create table log(
    id serial primary key, 
    somedata text, 
    classpath_id int references test1(id)
);

Next you create a view which allows the application to see the data:

create view log_view(id, somedata, classpath) as 
    select log.id, log.somedata, paths.classpath 
    from log 
    left join paths on log.classpath_id=paths.id;

Now your application can query in the same way as it queried your original data.

Updating the tables is slightly more complex. Postgresql (9.3+) does support updatable views, but they are only automatically updatable if there is a single table in the FROM clause - and we have two. In this case you would have to write triggers or rules which replace updates on the view with corresponding updates to the tables.

There are simple examples of this in the postgresql manual. I found a more complete example here.

Upvotes: 4

Related Questions