Thomas Tempelmann
Thomas Tempelmann

Reputation: 12043

SQL schema pattern for keeping history of changes

Consider a database that maintains a list of persons and their contact information, including addresses and such.

Sometimes, the contact information changes. Instead of simply updating the single person record to the new values, I like to keep a history of the changes.

I like to keep the history in a way that when I look at a person's record, I can quickly determine that there are older recordings of that person's data as well. However, I also like to avoid having to build very complicated SQL queries for retrieving only the latest version of each person's records (while this may be easy with a single table, it quickly gets difficult once the table is connected to other tables).

I've come up with a few ways, which I'll add below as answers, but I wonder if there are better ways (While I'm a seasoned code writer, I'm rather new to DB design, so I lack the experience and already ran into a few dead ends).

Which DB? I am currently using sqlite but plan to move to a server based DB engine eventually, probably Postgres. However, I meant this question asked in a more general form, not specific to any particular engine, though suggestions how to solve this in certain engines are appreciated, too, in the general interest.

Upvotes: 6

Views: 7674

Answers (6)

Mike Nakis
Mike Nakis

Reputation: 61969

Must you keep structured history information?

  • Quite often, the history of changes does not have to be structured, because the history is needed for auditing purposes only, and there is no actual need to be able to perform queries against the historical data.

    • So, what quite often suffices is to simply log each modification that is made to the database, for which you only need a log table with a date-time field and some variable length text field into which you can format human-readable messages as to who changed what, and what the old value was, and what the new value is.
    • Nothing needs to be added to the actual data tables, and no additional complexity needs to be added to the queries.

If you must keep structured history information:

  • If you need to able to execute queries against historical data, then you must keep the historical data in the database. Some people recommend separate historical tables; I consider this misguided. Instead, I recommend using views.

    • Rename each table from "NAME" to "NAME_HISTORY" and then create a view called "NAME" which presents to you only the latest records.
    • Views are a feature which exists in most RDBMSes. A view looks like a table, so you can query it as if it was a table, but it is read-only, and it can be created by simply defining a query on existing tables (and views.)
    • So, with a query which orders the rows by history-date, groups by all fields except history-date, selects all fields except history-date, and picks only the first row, you can create a view that looks exactly like the original table before historicity was added.
    • Any existing code which just performs queries and does not need to be aware of history will continue working as before.
    • Code that performs queries against historical data, and code that modifies tables, will now need to start using "NAME_HISTORY" instead of "NAME".
    • It is okay if code which modifies the table is burdened by having to refer to the table as "NAME_HISTORY" instead of "NAME", because that code will also have to take into account the fact that it is not just updating the table, it is appending new historical records to it.
    • As a matter of fact, since views are read-only, the use of views will prevent you from accidentally modifying a table without taking care of historicity, and that's a good thing.

Upvotes: 10

izzy
izzy

Reputation: 807

We use a history integer column. New rows are always inserted with a history of 0, and any previous rows for that entry have the history incremented by 1.

Depending on how often the historical data is to be used, it might be wise to store history rows in a separate table. A simple view could be used if the combined data is desired, and it should speed things up if you usually just need the current rows.

Upvotes: 0

Geert Bellekens
Geert Bellekens

Reputation: 13701

We use what we call Verity-Block pattern.

enter image description here

The verity contains the periodicity, the block contains immutable data.

In the case of personal data we have the Identity verity that has a validity period, and the IdentificationBlock that contains the data such as Name, LastName, BirthDate

Block are immutable, so whenever we change something the application makes sure to create a new block.

So in case your last name changes on 01/01/2015 from Smits to Johnson then we have a verity Identity valid from [mindate] to 31/12/2014 that is linked to an IdentificationBlock where Lastname = Smits and an Identity that is valid from 01/01/2014 to [maxdate] linked to an IdentificationBlock where LastName = Johnson.

So in the database we have tables:

Identification
  ID_Identification [PK]

Identity
  ID_Identity [PK]
  ID_Identification [FK]
  ID_IdentificationBlock [FK]
  ValidFrom
  ValidTo

IdentificationBlock
  ID_IdentificationBlock [PK]
  ID_Identification [FK]
  FirstName
  LastName
  BirthDate

A typical query to get the current name would be

Select idb.Name, idb.LastName from IdentificationBlock idb
join Identity i on idb.ID_Identification = i.ID_Identification
where getDate() between i.ValidFrom and i.ValidTo

Upvotes: 1

Anton Gogolev
Anton Gogolev

Reputation: 115691

This is generally referred to as Slowly Changing Dimension and linked Wikipedia page offers several approaches to make this thing work.

Martin Fowler has a list of Temporal Patterns that are not exactly DB-specific, but offer a good starting point.

And finally, Microsoft SQL Server offers Change Data Capture and Change Tracking.

Upvotes: 11

Thomas Tempelmann
Thomas Tempelmann

Reputation: 12043

Move older versions into a separate "history" table.

By using SQL triggers the old data is automatically written to the "history" table.

Pros:

Queries that ask for only the latest data remain simple.

By using triggers, updating data doesn't need to be concerned with maintaining the history.

Maintainig a FTS (Full Text Search) table in sqlite only for the most recent versions of data is easy because the triggers would be attached only to the "current" (non-history) table, thereby avoiding storing of obsolete data.

Cons:

Detection of history entries requires parsing a separate table (that's not a big issue, though). This may also be alleviated by adding a backlink column as a foreign key to the history table.

Every table that shall maintain a history needs a duplicate table for the history. Makes writing the schema tedious unless program code is written to create such "history" tables dynamically.

Upvotes: 0

Thomas Tempelmann
Thomas Tempelmann

Reputation: 12043

Add an "active" flag or add a "version" number.

  • Using a flag requires adding a condition such as active=1 to every query's WHERE clause involving the table.

  • Using a version number requires adding a subquery such as:

    version = (SELECT MAX(version) FROM MyTable t2 WHERE MyTable.id = t2.id)

Pros:

Keeps the database design simple.

Detection of history entries is easy - just remove the extra condition from the queries.

Cons:

Updating data requires setting the active or version values accordingly. (Though this might be handled with SQL triggers, I guess.)

Complicates queries. While this may not affect the performance, it's getting more difficult to write and maintain such queries by hand the more complex the queries get, especially when involving joined queries.

Foreign keys into this table cannot use the rowid to refer to a person because updates to the person create a new entry in the table, thereby effectively changing the rowid of the latest data for the person.

Maintainig a FTS (Full Text Search) table in sqlite only for the most recent versions of data is slightly more difficult due to the triggers for automatic updates to the FTS need to take the active or version values into account in order to make sure that only the latest data is stored, while outdated data gets removed.

Upvotes: 0

Related Questions