Best practice for setup and querying versioned records in T-SQL

Question

I'm trying to optimize my SQL queries and I always come back to this one issue and I was hoping to get some insight into how I could best optimize this.

For brevity, lets say I have a simple employee table:

tbl_employees

Id     HiredDateTime
------------------
1      ...        
2      ...

That has versioned information in another another table for each employee:

tbl_emplyees_versioned

Id     Version   Name     HourlyWage
-------------------------------
1      1         Bob      10
1      2         Bob      20
1      3         Bob      30
2      1         Dan      10
2      2         Dan      20

And this is how the latest version records are retrieved in a View:

Select tbl_employees.Id, employees_LatestVersion.Name, employees_LatestVersion.HourlyWage, employees_LatestVersion.Version
From tbl_employees
Inner Join tbl_employees_versioned
 ON tbl_employees.Id = tbl_employees_versioned.Id
CROSS APPLY 
   (SELECT Id, Max(Version) AS Version
    FROM tbl_employees_versioned AS employees_LatestVersion
    WHERE Id = tbl_employees_versioned.Id
    GROUP BY Id) AS employees_LatestVersion

To get a response like this:

Id     Version   Name     HourlyWage
-------------------------------
1      3         Bob      30
2      2         Dan      20

When pulling a query that has over 500 employees records for which each have a couple few versions, this query starts choking up and takes a few seconds to run.

There are a couple strikes right off the bat, but I'm not sure how to overcome them.

Obviously the Cross Apply adds some performance loss. Is there a best practice when dealing with versioned information like this? Is there a better way to get just a record with the highest version?
The versioned table doesn't have a clustered index beause neither Id or Version are unique. Concatenated together they would be, but it doesn't work like that. Instead there is a non-clustered index for Id and another one for Version. Is there a better way to index this table to get any performance gain? Would an indexed view really help here?

Gordon Linoff · Accepted Answer

I think the best way to structure the data is using start dates and end dates. So, the data structure for your original table would look like:

create table tbl_EmployeesHistory (
    EmployeeHistoryId int,
    EffDate date not null,
    EndDate date,
    -- Fields that describe the employee during this time
)

Then, you can see the current version using a view:

create view vw_Employees as
    select *
    from tbl_EmployeesHistory
    where EndDate is NULL

In some cases, where future end dates are allowed, the where clause would be:

where coalesce(EndDate, getdate()) >= getdate()

Alternatively, in this case, you can default EndDate to some future date far, far away such as '01-o1-9999'. You would add this as the default in the create table statement, make the column not null, and then you can always use the statement:

where getdate() between EffDate and EndDate

As Martin points out in his comment, the coalesce() might impede the use of an index (it does in SQL Server), whereas this does not have that problem.

This is called a slowly changing dimension. Ralph Kimball discusses this concept in some length in his books on data warehousing.

Best practice for setup and querying versioned records in T-SQL

Answers (2)

Related Questions