Reputation: 67
TL;DR: Is this design correct and how should I query it?
Let's say we have history tables for city and address designed like this:
CREATE TABLE city_history (
id BIGINT UNSIGNED NOT NULL PRIMARY KEY,
name VARCHAR(128) NOT NULL,
history_at DATETIME NOT NULL,
obj_id INT UNSIGNED NOT NULL
);
CREATE TABLE address_history (
id BIGINT UNSIGNED NOT NULL PRIMARY KEY,
city_id INT NULL,
building_no VARCHAR(10) NULL,
history_at DATETIME NOT NULL,
obj_id INT UNSIGNED NOT NULL
);
Original tables are pretty much the same except for history_id and obj_id (city: id, name; address: id, city_id, building_no). There's also a foreign key relation between city and address (city_id).
History tables are populated on every change of the original entry (create, update, delete) with the exact state of the entry at given time.
obj_id holds id of original object - no foreign key, because original entry can be deleted and history entries can't. history_at is the time of creation of history entry.
History entries are created for every table independently - change in city name creates city_history entry but does not create address_history entry.
So to see what was the state of the whole address with city (e.g. on printed documents) at any T1 point in time, we take from both history tables most recent entries for given obj_id created before T1, right? With this design in theory we should be able to see the state of signle address with city at any given point of time. Could anyone help me create such a query for given address id and time? Please note that there could be multiple records with the same exact timestamp.
There is also a need to create a report for showing every change of state of given address in given time period with entries like "city_name, building_no, changed_at". Is it something that can be created with SQL query? Performance doesn't matter here so much, such reports won't be generated so often.
The above report will probably be needed in an interactive version where user can filter results e.g. by city name or building number. Is it still possible to do in SQL?
In reality address table and address_history table have 4 more foreign keys that should be joined in report (street, zip code, etc.). Wouldn't the query be like ten pages long to provide all the needed functionality?
I've tried to build some queries, play with greatest-n-per-group, but I don't think I'm getting anywhere with this. Is this design really OK for my use cases (if so, can you please provide some queries for me to play with to get where I want?)? Or should I rethink the whole design?
Any help appreciated.
Upvotes: 1
Views: 5825
Reputation: 15961
(My answer copied from here, since that question never marked an answer as accepted.)
My normal "pattern" in (very)pseudo code:
Triggers on A:
In this scenario, you'd query history with "x >= from and x < to" (not BETWEEN as the a previous record's "from" value should match the next's to "value").
Additionally, this pattern also makes "change log" reports easier.
SELECT * FROM A_history WHERE valid_from BETWEEN [reporting interval] OR valid_to BETWEEN [reporting interval]
. Note: This pattern can be implemented without triggers. Using a stored procedure, or even just multiple queries in code, can actually negate the need for the non-history table.
valid_to
.Upvotes: 3
Reputation: 20899
This is a very "traditional" Problem, when it comes down to versioning (or monitoring) of changes to a certain row.
There are various "solutions", each having its own drawback and advantage.
The following "statements" are a result of my expericence, they are neither perfect, nor do I claim they are the "only ones"!
1.) Creating a "history table": That's the worst Idea of all. You would always need to take into account which table you need to query, depending on DATA that should be queried. That's a "Chicken-Egg" Problem...
2.) Using ONE Table with ONE (increasing) "Revision" Number: That's a better approach, but it will get "hard" to query: Determining the "most recent row" per "id" is very costly no matter which aproach is used.
My personal expierence is, that following the pattern of a "double linked List" ist the best to solve this, when it comes down to Millions of records:
3.) Maintain two columns among every entity, let's say prev_version_id
and next_version_id
. prev_version_id
points to NULL
, if there is no previous version. next_version_id
points to NULL
if there is no later version.
This approach would require you to ALWAYS perform two actions upon an update:
next_version_id
) to the just insterted row. However, when your database has grown to something like 100 Million Rows, you will be very happy that you have choosen this path:
where ISNULL(prev_version_id) and entity_id = 5
where ISNULL(next_version_id) and entity_id = 5
entity_id=5
of the data-table, sortable by either prev_version_id
or next_version_id
.The very often neglected fact: The first two queries will also work to get a list of ALL first versions
or of ALL recent versions
of an entity - in about NO TIME! (Don't underestimate how "costly" it can be do determine the most recent version of an entity otherwise! Believe me, when "testing" everything seems equaly fine, but the real struggle starts when live-data with millions of records is used.)
cheers, dognose
Upvotes: 3