gloomy.penguin
gloomy.penguin

Reputation: 5911

SQL query with start and end dates - what is the best option?

I am using MS SQL Server 2005 at work to build a database. I have been told that most tables will hold 1,000,000 to 500,000,000 rows of data in the near future after it is built... I have not worked with datasets this large. Most of the time I don't even know what I should be considering to figure out what the best answer might be for ways to set up schema, queries, stuff.

So... I need to know the start and end dates for something and a value that is associated with in ID during that time frame. SO... we can the table up two different ways:

create table xxx_test2 (id int identity(1,1), groupid int, dt datetime, i int) 

create table xxx_test2 (id int identity(1,1), groupid int, start_dt datetime, end_dt datetime, i int) 

Which is better? How do I define better? I filled the first table with about 100,000 rows of data and it takes about 10-12 seconds to set up in the format of the second table depending on the query...

    select  y.groupid,
            y.dt as [start], 
            z.dt as [end],   
            (case when z.dt is null then 1 else 0 end) as latest, 
            y.i 
    from    #x as y 
            outer apply (select top 1 * 
                            from    #x as x 
                            where   x.groupid = y.groupid and 
                                    x.dt > y.dt 
                            order by x.dt asc) as z         

or
http://consultingblogs.emc.com/jamiethomson/archive/2005/01/10/t-sql-deriving-start-and-end-date-from-a-single-effective-date.aspx

Buuuuut... with the second table.... to insert a new row, I have to go look and see if there is a previous row and then if so update its end date. So... is it a question of performance when retrieving data vs insert/update things? It seems silly to store that end date twice but maybe...... not? What things should I be looking at?

this is what i used to generate my fake data... if you want to play with it for some reason (if you change the maximum of the random number to something higher it will generate the fake stuff a lot faster):

declare @dt datetime
declare @i int
declare @id int
set @id = 1
declare @rowcount int
set @rowcount = 0
declare @numrows int 

while (@rowcount<100000)
begin

set @i = 1
set @dt = getdate()
set @numrows = Cast(((5 + 1) - 1) * 
                Rand() + 1 As tinyint)

while @i<=@numrows
    begin
    insert into #x values (@id, dateadd(d,@i,@dt), @i)
    set @i = @i + 1
    end 

set @rowcount = @rowcount + @numrows
set @id = @id + 1
print @rowcount
end 

Upvotes: 0

Views: 2071

Answers (2)

Aprillion
Aprillion

Reputation: 22324

for anyone who can use LEAD Analytic function of SQL Server 2012 (or Oracle, DB2, ...), retrieving data from the 1st table (that uses only 1 date column) would be much much quicker than without this feature:

select
  groupid,
  dt "start",
  lead(dt) over (partition by groupid order by dt) "end",
  case when lead(dt) over (partition by groupid order by dt) is null
       then 1 else 0 end "latest",
  i
from x

Upvotes: 0

Tadish Durbin
Tadish Durbin

Reputation: 44

For your purposes, I think option 2 is the way to go for table design. This gives you flexibility, and will save you tons of work.

Having the effective date and end date will allow you to have a query that will only return currently effective data by having this in your where clause:

where sysdate between effectivedate and enddate

You can also then use it to join with other tables in a time-sensitive way.

Provided you set up the key properly and provide the right indexes, performance (on this table at least) should not be a problem.

Upvotes: 3

Related Questions