Reputation: 9191

Database design for point in time "snapshot" of data?

How to design a database that supports a feature that would allow the application user to create a snapshot of their data at a point in time, a bit like version control.

It would give the user the ability to go back and see what their data looked like in the past.

Assume that the data being "snapshotted" is complex and includes joins of multiple tables.

I'm looking for a way to give each application user the ability to snapshot their data and go back to it. Whole database snapshots is not what I'm looking for.

EDIT: Thanks for your answers. The 6NF answer is compelling as is the suggestion to de-normalise the snapshot data due to its simplicity.

Clarification: this is not a data warehousing question, nor is it a question about DB backup and restore; its about how to build a schema that allows us to capture the state of a specific set of related data at a point in time. The snapshots are generated by the application users when they see fit. Users do not snapshot the entire DB, just the data object they are interested in.

Upvotes: 27

Answers (11)

Soundararajan

Reputation: 2204

In modern day database products i think this is implemented by default. If you are using databricks delta tables, it supports versioning by default. Check Delta time travel.It sort of implements the database snapshot pattern although the internals of the implementation is abstracted by delta tables and the user gets it at free of cost.

Upvotes: 0

Hadi Ismanto

Reputation: 181

I would use an additional timestamp field on each table, whether it is master table or fact/ transaction table. And all table need to be newly inserted even when the goal is to update.

Upvotes: 0

mancini0

Reputation: 4733

Maybe consider using a NoSql solution like MongoDB to aggregate all of your relational data into a single document, then store that document with a timestamp or version number. Solutions like Kafka-Connect or Oracle Golden Gate simplify piping relational data into NoSql stores.

Upvotes: 0

grokster

Reputation: 6297

Use Log Triggers

All data changes are captured, giving the ability to query as if at any point in time.

Upvotes: 1

paweloque

Reputation: 18894

You can use the logs produced by your RDBMS to obtain snapshots of your data. Normally the logs are used to provide database recovery. They can however also be used to replicate the data across several RDBMS instances or to get snapshots of the data.

To get a snapshot of your data, simply take into account all the logs produced before the desired moment in time. You then "play-back" those logs to obtain the actual database with your data restored.

How to access and "play-back" the logs depends on the concrete RDBMS product you use.

Another possibility is to use temporal databases. They have time-aspects built in and allow "going-back-in-time". Look for "Oracle Flashback Technology" for example. http://www.stanford.edu/dept/itss/docs/oracle/10g/appdev.101/b10795/adfns_fl.htm#ADFNS1008

Upvotes: 0

brianegge

Reputation: 29892

Having snapshots and/or an audit trail is a common database requirement. For many applications, creating 'shadow' or audit tables is an easy and straight forward task. While database level backups and transaction logs are good to have, they are not a version control system.

Basically, you need create a shadow table with all the same columns as the base table, and then setup triggers on the base table to place a copy of the row in the shadow table when ever it is updated or deleted.

Through some logic you can recreate what the data looked like at a given point in time. For an easy way to set this up in Sybase see: http://www.theeggeadventure.com/wikimedia/index.php/Sybase_Tips#create_.27audit.27_columns

If you need to do lots of historical snapshots, then you can keep the data in the same table. Basically, create two columns - an added and deleted column. The downside is for every query you must add a where clause. Of course you can create a view, which shows just the active records. This gets a bit more complicated if you have a normalized database with multiple tables, all with history.

However, it does work. You simply have the 'added' and 'deleted' columns on each table, and then your query has the point in time of interest. Whenever data is modified you must copy the current row, and mark it as deleted.

Upvotes: 3

Ryan Barrett

Reputation: 1007

With SQL Server at least, you can use Full logging and keep the transaction logs between each backup set.

Then you can do an point-in-time backup.

That's a poor solution.

What exactly does your client want? Is it for analytical purposes (i.e. the questions are like how many orders did we have two weeks ago) ? Because that's exactly the problem that a datawarehouse solves.

Upvotes: 0

tpdi

Reputation: 35171

This is NOT easy.

You're essentially asking for a Temporal Database (What Christopher Date calls Sixth Normal Form, or 6NF).

To be 6NF, a schema must also be 5NF, and, basically, for each datum, you need to attach a time range for which the datum at that value is applicable. Then in joins, the join must include only the rows that are within the time range being considered.

Temporal modeling is hard -- it's what 6th Normal Form addresses -- and not well supported in current RDBMSes.

The problem is the granularity. 6th Normal Form (as I understand it) supports temporal modeling by making every non-key (non-key:, i.e., anything "on" the entity that can change without the entity losing its identity) a separate relation. To this, you add a timestamp or time range or version number. Making everything a join solves the granularity problem, but it also means your queries are going to be more complicated and slower. It also requires figuring out all keys and non-key attributes; this tends to be a large effort.

Basically, everywhere you have a relation ("ted owns the GM stock certificate with id 789") you add a time: "ted owns the GM stock certificate with id 789 now" so that you can simultaneously say, "fred owns the GM stock certificate with id 789 from 3 Feb 2000 to yesterday". Obviously these relations are many-to-many, (ted can own more than one certificate now, and more than one over his lifetime, too, and fred can have previously owned the certificate jack owns now).

So we have a table of owners, and a table of stock certificates, and a many-to-many table that relates owners and certificates by id. To the many-to-many table, we add a start_date and an end_date.

Now, imagine that each state/province/land taxes the dividends on stock certificates, so for tax purposes to record the stock certificate's owner's state of residency.

Where the owner resides can obviously change independently with stock ownership; ted can live in Nebraska, buy 10 shares, get a dividend that Nebraska taxes, move to Nevada, sells 5 shares to fred, buy 10 more shares.

But for us, it's ted can move to Nebraska at some time, buy 10 shares at some time, get a dividend at some time, which Nebraska taxes, move to Neveda at some time, sell 5 shares to fred at some time, buy 10 more shares at some time.

We need all of that if we want to calculate what taxes ted owes in Nebraska and in Nevada, joining up on the matching/overlapping date ranges in person_stockcertificate and person_address. A person's address is no longer one-to-one, it's one-to-many because it's address during time range.

If ted buys ten shares, do we model a buy event with a single purchase date, or do we add a date_bought to each share? Depends on the question we need the model to answer.

Upvotes: 19

Kamiel Wanrooij

Reputation: 12404

We did this once by creating separate database tables that contained the data we wanted to snapshot, but denormalized, i.e. every record contained all data required to make sense, not references to id's that may or may no longer exist. It also added a date to each row.

Then we produced triggers for specific inserts or updates that did a join on all affected tables, and inserted it into the snapshot tables.

This way it would be trivial to write something that restored the users' data to a point in time.

If you have a table:

user:

id, firstname, lastname, department_id

department:

id, name, departmenthead_id

your snapshot of the user table could look like this:

user_id, user_firstname, user_lastname, department_id, department_name, deparmenthead_id, deparmenthead_firstname, departmenthead_lastname, snapshot_date

and a query something like

INSERT INTO usersnapshot
SELECT user.id AS user_id, user.firstname AS user_firstname, user.lastname AS user_lastname,
department.id AS department_id, department.name AS department_name
departmenthead.id AS departmenthead_id, departmenthead.firstname AS departmenthead_firstname, departmenthead.lastname AS departmenthead_lastname,
GETDATE() AS snapshot_date
FROM user
INNER JOIN department ON user.department_id = department.id
INNER JOIN user departmenthead ON department.departmenthead_id = departmenthead.id

This ensures each row in the snapshot is true for that moment in time, even if department or department head has changed in the meantime.

Upvotes: 13

zendar

Reputation: 13582

Oracle from version 9i have Flashback technology, which is in Oracle 10g and 11g much improved and you can see state of your database at any given point in history, provided you enable flashback.

Check this document: Flashback Overview

Upvotes: 0

Mitch Wheat

Reputation: 300769

SQL Server 2005 (onwards) Enterprise Edition has the ability to create Database snapshots

Upvotes: -1

Database design for point in time &quot;snapshot&quot; of data?

Answers (11)

Related Questions

Database design for point in time "snapshot" of data?