Reputation: 739
I want to select a random value of table for each row of a another table.
I have the following code:
SELECT T1.COL1,(SELECT TOP 1 t2.COL2 FROM T2 ORDER BY NEWID())FROM T1
I understand the reason why it does not work. It selects a random value from T2, but it is the same for each row selected from T1.
Thank you.
Upvotes: 2
Views: 929
Reputation: 32703
The query suggested by Gordon is correct and it produces expected result, but it is quite inefficient if your tables have more than a handful of rows.
I'll repeat it here:
SELECT T1.COL1,
(SELECT TOP 1 t2.COL2 FROM T2 where t1.col1 is not null ORDER BY NEWID() )
FROM T1
Essentially, it runs the subquery for each row of the T1
table. The subquery reads the whole T2
table, sorts the whole table, picks one row and discards the rest. The T2
table is scanned as many times as there are rows in T1
.
The query can return the same value t2.COL2
for different rows of T1
.
If this is the requirement and part of the definition of "random selection", then you can't do much about it.
But, if there is no such a requirement and it is allowed to map two tables row-to-row without repetitions, then there is much faster way to do it by scanning the source tables only once.
In my test SQL Server 2008 I have a table Numbers
with 100,000 rows with numbers from 1 to 100,000 and a table Calendar
with dates from 2000-01-01 to 2037-12-31 (13880 rows).
So, I wrote two queries:
Row number
WITH
CTE1
AS
(
SELECT
T1.dt
,ROW_NUMBER() OVER (ORDER BY dt) AS rn1
FROM dbo.Calendar AS T1
)
,CTE2
AS
(
SELECT
T2.Number
,ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn2
FROM dbo.Numbers AS T2
)
SELECT
CTE1.dt
,CTE2.Number
FROM
CTE1
INNER JOIN CTE2 ON CTE1.rn1 = CTE2.rn2
;
Subquery
SELECT T1.dt,
(SELECT TOP 1 t2.Number FROM dbo.Numbers AS T2 where t1.dt is not null ORDER BY NEWID() )
FROM dbo.Calendar AS T1
;
I ran them in SQL Sentry Plan Explorer and compared their execution plans and performance:
As you can see from this screenshot, the RowNumber query finished in 233 ms and Subquery variant took 289,950 ms. About 1200 times more.
It becomes apparent what is going on when you look at the execution plans:
RowNumber
You can see that both tables are scanned once and merge joined together.
SubQuery
Here the Numbers
table is scanned and sorted 13880 times.
I kept the RowNumber variant simple to illustrate the concept. It will not work as intended if T1
has more rows than T2
.
It is easy to fix it, though by cross joining T2
to itself as many times as needed to generate enough rows before calculating row numbers.
Upvotes: 3
Reputation: 1271151
This is a tricky part of SQL Server. It is a bit too aggressive in optimizing queries. You can prevent this with a correlation clause:
SELECT T1.COL1,
(SELECT TOP 1 t2.COL2 FROM T2 where t1.col1 is not null ORDER BY NEWID() )
FROM T1
The added clause where t1.col1 is not null
will force SQL Server to compute the subquery for each row. Without this, as in your original query, the subquery is being computed once, and then getting cached.
Upvotes: 5