Reputation: 739

How to assign a random value in a select statement in SQL?

I want to select a random value of table for each row of a another table.

I have the following code:

SELECT T1.COL1,(SELECT TOP 1 t2.COL2 FROM T2 ORDER BY NEWID())FROM T1

I understand the reason why it does not work. It selects a random value from T2, but it is the same for each row selected from T1.

Thank you.

Upvotes: 2

Answers (2)

Vladimir Baranov

Reputation: 32703

The query suggested by Gordon is correct and it produces expected result, but it is quite inefficient if your tables have more than a handful of rows.

I'll repeat it here:

SELECT T1.COL1,
       (SELECT TOP 1 t2.COL2 FROM T2 where t1.col1 is not null ORDER BY NEWID() )
FROM T1

Essentially, it runs the subquery for each row of the T1 table. The subquery reads the whole T2 table, sorts the whole table, picks one row and discards the rest. The T2 table is scanned as many times as there are rows in T1.

The query can return the same value t2.COL2 for different rows of T1.

If this is the requirement and part of the definition of "random selection", then you can't do much about it.

But, if there is no such a requirement and it is allowed to map two tables row-to-row without repetitions, then there is much faster way to do it by scanning the source tables only once.

In my test SQL Server 2008 I have a table Numbers with 100,000 rows with numbers from 1 to 100,000 and a table Calendar with dates from 2000-01-01 to 2037-12-31 (13880 rows).

So, I wrote two queries:

Row number

WITH
CTE1
AS
(
    SELECT
        T1.dt
        ,ROW_NUMBER() OVER (ORDER BY dt) AS rn1
    FROM dbo.Calendar AS T1
)
,CTE2
AS
(
    SELECT
        T2.Number
        ,ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn2
    FROM dbo.Numbers AS T2
)
SELECT
    CTE1.dt
    ,CTE2.Number
FROM
    CTE1
    INNER JOIN CTE2 ON CTE1.rn1 = CTE2.rn2
;

Subquery

SELECT T1.dt,
    (SELECT TOP 1 t2.Number FROM dbo.Numbers AS T2 where t1.dt is not null ORDER BY NEWID() )
FROM dbo.Calendar AS T1
;

I ran them in SQL Sentry Plan Explorer and compared their execution plans and performance:

As you can see from this screenshot, the RowNumber query finished in 233 ms and Subquery variant took 289,950 ms. About 1200 times more.

It becomes apparent what is going on when you look at the execution plans:

RowNumber

You can see that both tables are scanned once and merge joined together.

SubQuery

Here the Numbers table is scanned and sorted 13880 times.

I kept the RowNumber variant simple to illustrate the concept. It will not work as intended if T1 has more rows than T2.

It is easy to fix it, though by cross joining T2 to itself as many times as needed to generate enough rows before calculating row numbers.

Upvotes: 3

Gordon Linoff

Reputation: 1271151

This is a tricky part of SQL Server. It is a bit too aggressive in optimizing queries. You can prevent this with a correlation clause:

SELECT T1.COL1,
       (SELECT TOP 1 t2.COL2 FROM T2 where t1.col1 is not null ORDER BY NEWID() )
FROM T1

The added clause where t1.col1 is not null will force SQL Server to compute the subquery for each row. Without this, as in your original query, the subquery is being computed once, and then getting cached.

Upvotes: 5

How to assign a random value in a select statement in SQL?

Answers (2)

Related Questions