Reputation: 1551
I'm confused. How could you explain this diffenece in variable concatenation with ORDER BY?
declare @tbl table (id int);
insert into @tbl values (1), (2), (3);
declare @msg1 varchar(100) = '', @msg2 varchar(100) = '',
@msg3 varchar(100) = '', @msg4 varchar(100) = '';
select @msg1 = @msg1 + cast(id as varchar) from @tbl
order by id;
select @msg2 = @msg2 + cast(id as varchar) from @tbl
order by id+id;
select @msg3 = @msg3 + cast(id as varchar) from @tbl
order by id+id desc;
select TOP(100) @msg4 = @msg4 + cast(id as varchar) from @tbl
order by id+id;
select
@msg1 as msg1,
@msg2 as msg2,
@msg3 as msg3,
@msg4 as msg4;
Results
msg1 msg2 msg3 msg4
---- ---- ---- ----
123 3 1 123
Upvotes: 5
Views: 4670
Reputation: 13209
As many have confirmed, this is not the right way to concatenate all the rows in a column into a variable - even though in some cases it does "work". If you want to see some alternatives, please check out this blog.
According to MSDN (applies to SQL Server 2008 through 2014 and Azure SQL Database) , the SELECT
should not be used to assign local variables. In the remarks, it describes how, when you do use the SELECT
, it attempts to behave. The interesting points to note:
The first two points here are key - concatenation happens to work because SELECT @msg1 = @msg1 + cast(id as varchar)
is essentially SELECT @msg1 += cast(id as varchar)
, and as the syntax notes, +=
is an accepted compound assignment operator on this expression. Please note here that it should not be expected this operation to continue to be supported on VARCHAR
and to do string concatenation - just because it happens to work in some situations doesn't mean it is ok for production code.
The bottom line as to the underlying reason is whether the Compute Scalar
that runs on the select expression uses the original id column or an expression of the id column. You probably can't find any docs on why the optimizer might choose the specific plans for each query, but each example highlights different use cases that allow the msg value to be evaluated from the column (and therefore multiple rows being returned and concatenated) or expression (and therefore only the last column).
@msg1 is '123' because the Compute Scalar
(the row-by-row evaluation of the variable assignment) occurs after the Sort
. This allows the scalar computation to return multiple values on the id column concatenating them through the +=
compound operator. I doubt there is specific documentation why, but it appears the optimizer chose to do the sort before the scalar computation because the order by was a column and not an expression.
@msg2 is '3' because the Compute Scalar
is done before the sort, which leaves the @msg2 in each row just being the ('' + id) - so never concatenated, just the value of the id. Again, probably not any documentation why the optimizer chose this, but it appears that since the order by was an expression, perhaps it needed to do the (id+id) in the order by as part of the scalar computation before it could sort. At this point, your original column is no longer referencing the source column, but it has been replaced by an expression. Therefore, as MSDN stated, your first column points to an expression, not a column, so the behavior assigns the last value of the result set to the variable in the SELECT. Since you sorted ASC, you get '3' here.
@msg3 is '1' for the same reason as example 2, except you ordered DESC. Again, this becomes an expression in the evaluation - not the original column, so therefore the assignment gets the last value of the DESC order, so you get '1'.
@msg4 is '123' again because the TOP
operation forces an initial scalar evaluation of the ORDER BY
so that it can determine your top 100 records. This is different than examples 2 and 3 in which the scalar computation contained both the order by and select computations which caused each example to be an expression and not refer back to the original column. Example 4 has the TOP separating the ORDER BY and SELECT computations, so after the SORT (TOP N SORT) is applied, it then does the scalar computation for the SELECT columns in which at this point you are still referencing the original column (not an expression of the column), and therefore it returns multiple rows allowing the concatenation to occur.
Sources:
Upvotes: 6
Reputation: 6018
All I can see is there is a difference in the execution plans. They all start with SELECT and end with Table Scan. The difference is in between, the Compute Scalar and the Sort.
@Msg1 has Compute Scalar then Sort. Results: 123
@Msg2 has Sort then Compute Scalar. Results: 3
@Msg3 has Sort then Compute Scalar. Results: 1
The fourth one is different because of the top. It still starts with select and ends with table scan, but it's different in the middle. It uses a different sort.
@Msg4 has Compute Scalar then Sort(Top N Sort) then Compute Scalar
Upvotes: 1
Reputation: 2544
SQL Server will calculate the results, then sort them, then return them. In the case of assigning a variable, only the first result will be used to populate your variable. You are receiving the first value from the sorted result sets, which can move around the order SQL Server will scan the records as well as the position within the results.
TOP will always produce special query plans as it immediately forces SQL Server to stick to the natural ordering of the results instead of producing query plans that would statistically reduce the number of records it must read.
To explain the differences, you'll have to refer to how SQL Server decided to implicitly sort the values to optimize the query.
Query 1
Insert -> Table Insert -> Constant Scan
Query 2
SELECT -> Compute Scalar -> Sort -> Table Scan
Query 3, and 4
SELECT -> Sort -> Compute Scalar -> Table Scan
Query 5 and 6 (using TOP)
SELECT -> Compute Scalar -> Sort (Top N) -> Compute Scalar -> Table Scan
I added Query 6:
select top (100)
@msg5 = @msg5 + cast(id as varchar)
from @tbl
order by id+id desc
Upvotes: 2
Reputation: 63772
You're not supposed to set variables in a select that returns more than a single row. Consider this code:
select top 1 @msg1 = @msg1 + cast(id as varchar) from @tbl
order by id;
select top 1 @msg2 = @msg2 + cast(id as varchar) from @tbl
order by id+id;
select top 1 @msg3 = @msg3 + cast(id as varchar) from @tbl
order by id+id desc;
select top 1 @msg4 = @msg4 + cast(id as varchar) from @tbl
order by id+id;
Producing 1, 1, 3 and 1, respectively.
I'm pretty surprised it doesn't cause an exception, I was quite sure it used to forbid this outright.
The underlying point is still the same: the SQL engine isn't just executing some commands procedurally, one by one, as you might expect. It will build an execution plan that is tailored to be as efficient as possible (given many constraints).
On the other hand, assigning a variable is inherently procedural, and requires an explicit execution / evaluation order to work correctly.
You're combining the two approches - select id from @tbl order by id
is a non-procedural query, but select @id = id from @tbl order by id
is a mix of the procedural @id = id
, and the very much non-procedural select.
Upvotes: 0