vzwick
vzwick

Reputation: 11044

Performant stable random sort in SQL Server?

I want to query a rather large table (millions of rows), providing a seed value, in a manner that will guarantee a random order - but one that remains stable across multiple queries as long as the same seed is used.

The best I've come up with so far is

SELECT TOP n *
      FROM tbl t
  ORDER BY t.int_column % seed, t.int_column

Is this a usable approach, both from a performance standpoint and a somewhat-uniform distribution of result rows over different seeds?

Edit:

For context, the stable sort is needed because of multiple - possibly nested - WHERE NOT IN queries that operate on the same data set; e.g.

SELECT *
  FROM tbl t
 WHERE t.some_criteria = 'some_value'
   AND t.id NOT IN
(
    SELECT TOP n t.id
          FROM tbl t
         WHERE t.some_other_criteria = 'some_other_value'
      ORDER BY t.int_column % seed, t.int_column
)
   AND t.id NOT IN
(
    # etc.
)

When the order of the subselects is random, but not stable (i.e. NEWID(), TABLESAMPLE()), the result rows fluctuate wildly between executions.

Upvotes: 0

Views: 233

Answers (2)

Jason A. Long
Jason A. Long

Reputation: 4442

Just a thought... You could add a "RamdomSort" column to you table. That way the the sort order will be truly random but will remain repeatable repeatable until you update the table with new values. Something along these lines...

ALTER TABLE dbo.MyTable ADD RandomSort INT NOT NULL 
CONSTRAINT df_MyTable_RandomSort DEFAULT(0);


UPDATE mt SET
    mt.RandomSort = ABS(CHECKSUM(NEWID())) % 100000 + 1
FROM
    dbo.MyTable mt;

SELECT 
    *
FROM
    dbo.MyTable mt
ORDER BY 
    mt.SomeValue;

If the situation warrants it, you could even add a covering, nonclustered index to eliminate the sort operation.

Upvotes: 0

Travis Gockel
Travis Gockel

Reputation: 27673

If you want random-looking ordering, you can do this with HASHBYTES and some piece of data from the row you're selecting.

SELECT TOP 100 *
  FROM tbl t
  ORDER BY HASHBYTES('SHA1', CONCAT(STR(t.int_column), 'seed string'))

Now, the performance of this is a big question. Modern CPUs do SHA1 very quickly, so this might be good enough for your needs.

If you can more about performance and less about "good randomness," you could drop in a simple linear congruential generator as the transformation function:

SET ARITHABORT    OFF;
SET ARITHIGNORE   ON;
SET ANSI_WARNINGS OFF;

SELECT TOP 100 *
  FROM tbl t
  ORDER BY ((t.int_column + seed_number) * 1103515245 + 12345)

This will be faster, but less random.

Upvotes: 1

Related Questions