derivation
derivation

Reputation: 4612

Best way to get distinct values from large table

I have a db table with about 10 or so columns, two of which are month and year. The table has about 250k rows now, and we expect it to grow by about 100-150k records a month. A lot of queries involve the month and year column (ex, all records from march 2010), and so we frequently need to get the available month and year combinations (ie do we have records for april 2010?).

A coworker thinks that we should have a separate table from our main one that only contains the months and years we have data for. We only add records to our main table once a month, so it would just be a small update on the end of our scripts to add the new entry to this second table. This second table would be queried whenever we need to find the available month/year entries on the first table. This solution feels kludgy to me and a violation of DRY.

What do you think is the correct way of solving this problem? Is there a better way than having two tables?

Upvotes: 7

Views: 26457

Answers (5)

David Sopko
David Sopko

Reputation: 5643

Use a 'Materialized View', also called an 'Indexed View with Schema Binding', and then index this view. When you do this SQL server will essentially create and maintain the data in a secondary table behind the scenes and choose to use the index on this table when appropriate.

This is similar to what your co-worker suggested, the advantage being you won't need to add logic to your query to take advantage of it, SQL Server will do this when it creates a query plan and SQL Server will also automatically maintain the data in the Indexed View.

Here is how you would accomplish this: create a view that returns the distinct [month] [year] values and then index [year] [month] on the view. Again SQL Server will use the tiny index on the view and avoid the table scan on the big table.

Because SQL server will not let you index a view with the DISTINCT keyword, instead use GROUP BY [year],[month] and use BIG_COUNT(*) in the SELECT. It will look something like this:

CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
AS

   SELECT
     [year],
     [month],
     COUNT_BIG(*) [MonthCount]
   FROM [dbo].[YourBigTable]
   GROUP BY [year],[month]
GO

CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month 
   ON [dbo].[vwMonthYear](Year,Month)

Now when you SELECT DISTINCT [Year],[Month] on the big table, the query optimizer will scan the tiny index on the view instead of scanning millions of records on the big table.

SELECT DISTINCT
   [year],
   [month]
FROM YourBigTable

This technique took me from 5 million reads with an estimated I/O of 10.9 to 36 reads with an estimated I/O of 0.003. The overhead on this will be that of maintaining an additional index, so each time the large table is updated the index on the view will also be updated.

If you find this index is substantially slowing down your load times. Drop the index, perform your data load and then recreate it.

Full working example:

    CREATE TABLE YourBigTable(
        YourBigTableID INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_YourBigTable_YourBigTableID PRIMARY KEY,
        [Year] INT,
        [Month] INT)
    GO  


    CREATE VIEW dbo.vwMonthYear WITH SCHEMABINDING
    AS

       SELECT
          [year],
          [month],
          COUNT_BIG(*) [MonthCount]
       FROM [dbo].[YourBigTable]
       GROUP BY [year],[month]
    GO

    CREATE UNIQUE CLUSTERED INDEX ICU_vwMonthYear_Year_Month ON [dbo].[vwMonthYear](Year,Month)


    SELECT DISTINCT
       [year],
       [month]
    FROM YourBigTable

-- Actual execution plan shows SQL server scaning ICU_vwMonthYear_Year_Month

Upvotes: 3

Gabriel Guimarães
Gabriel Guimarães

Reputation: 2744

Make sure to have an Clustered Index on those columns. and partition your table on these date columns an place the datafiles on different disk drives I Believe keeping your index fragmentation low is your best shot.

I also Believe having a physical view with the desired select is not a good idea, because it adds Insert/Update overhead. on average there's 3,5 insert's per minute. or about 17 seconds between each insert (on average please correct me if I'm wrong)

The question is are you selecting more often than every 17 seconds? That's the key thought. Hope it helped.

Upvotes: 4

Remus Rusanu
Remus Rusanu

Reputation: 294297

Make the date the first column in the table's clustered index key. This is very typical for historic data, because most, if not all, queries are interested in specific ranges and a clustered index on time can address this. All queries like 'month of May' need to be addressed as ranges, eg: WHERE DATECOLKEY BETWEEN '05/01/2010' AND '06/01/2001'. Answering a question like 'are there any records in May' will involve a simple seek into the clustered index.

While this seems complicated for a programmer mind, it is the optimal way to approach a database design problem.

Upvotes: 1

KM.
KM.

Reputation: 103597

create a materialized indexed view of:

SELECT DISTINCT
    MonthCol, YearCol
    FROM YourTable

you will now get access to the pre-computed distinct values without going through the work every time.

Upvotes: 1

Adriaan Stander
Adriaan Stander

Reputation: 166406

Using a simple index on the columns required (Year and Month) should greatly improve either a DISTINCT, or GROUP BY Query.

I would not go with a secondary table as this adds extra over head to maintaining the secondary table (inserts/updates deletes will require that you validate the secondary table)

EDIT:

You might even want to consider using Improving Performance with SQL Server 2005 Indexed Views

Upvotes: 13

Related Questions