Ramfjord
Ramfjord

Reputation: 939

Is there a disadvantage to having large columns in your database?

My database stores user stats on a variety of questions. There is no table of question types, so instead of using a join table on the question types, I've just stored the user stats for each type of question the user has done in a serialized hash-map in the user table. Obviously this has led to some decently sized user rows - the serialized stats for my own user is around 950 characters, and I can imagine them easily growing to 5 kb on power users.

I have never read an example of a column this large in any book. Will performance be greatly hindered by having such large/variable columns in my table? Should I add in a table for question types, and make the user stats a separate table as well?

I am currently using PostgreSQL, if that's relevant.

Upvotes: 1

Views: 459

Answers (4)

Branko Dimitrijevic
Branko Dimitrijevic

Reputation: 52117

Depends on the predominant queries you need:

  • If you need queries that select all (or most) of the columns, then this is the optimal design.
  • If, however, you select mostly on a subset of columns, then it might be worth trying to "vertically partition"1 the table, so you avoid I/O for the "unneeded" columns and increase the cache efficiency.2

Of course, all this is under assumption that the serialized data behaves as "black box" from the database perspective. If you need to search or constrain that data in some fashion, then just storing a dummy byte array would violate the principle of atomicity and therefore the 1NF, so you'd need to consider normalizing your data...


1 I.e. move the rarely used columns to a second table, which is in 1:1 relationship to the original table. If you are using BLOBs, similar effect could be achieved by declaring what portion of the BLOB should be kept "in-line" - the remainder of any BLOB that exceeds that limit will be stored to a set of pages separate from the table's "core" pages.

2 DBMSes typically implement caching at the page level, so the wider the rows, the less of them will fit into a single page on disk, and therefore into a single page in cache.

Upvotes: 2

Chris Travers
Chris Travers

Reputation: 26464

The big disadvantage has to do with what happens with a select *. If you have a specific field list, you are not likely to have a big problem but with select * with a lot of TOASTed columns, you have a lot of extra random disk I/O unless everything fits in memory. Selecting fewer columns makes things better.

In an object-relational database like PostgreSQL, database normalization poses different tradeoffs than in a purely relational model. In general it is still a good thing (as I say push the relational model as far as it can comfortably go before doing OR stuff in your db), but it isn't the absolute necessity that you might think of it as being in a purely relational db. Additionally you can add functions to process that data with regexps, extract elements from JSON, etc, and pull those back into your relational queries. So for data that cannot comfortably be normalized, big amorphous "docdb" fields are not that big of a problem.

Upvotes: 2

Nathan
Nathan

Reputation: 2775

I've seen this serialized approach on systems like ProcessMaker, which is a web workflow and BPM app and stores its data in a serialized fashion. It performs quite well, but building reports based on this data is really tricky.

You can (and should) normalize your database, which is OK if your information model doesn´t change so often.

Otherwise, you may want to try non-relational databases like RavenDB, MongoDB, etc.

Upvotes: 3

Cybercartel
Cybercartel

Reputation: 12592

You can't search in serialzed arrays.

Upvotes: 1

Related Questions