Alexander Sigachov
Alexander Sigachov

Reputation: 1551

Using SQL concatenation with ORDER BY

I'm confused. How could you explain this diffenece in variable concatenation with ORDER BY?

declare @tbl table (id int);
insert into @tbl values (1), (2), (3);

declare @msg1 varchar(100) = '', @msg2 varchar(100) = '',
    @msg3 varchar(100) = '',    @msg4 varchar(100) = '';

select @msg1 = @msg1 + cast(id as varchar) from @tbl
order by id;

select @msg2 = @msg2 + cast(id as varchar) from @tbl
order by id+id;

select @msg3 = @msg3 + cast(id as varchar) from @tbl
order by id+id desc;

select TOP(100) @msg4 = @msg4 + cast(id as varchar) from @tbl
order by id+id;

select
    @msg1 as msg1,
    @msg2 as msg2,
    @msg3 as msg3,
    @msg4 as msg4;

Results

msg1  msg2  msg3  msg4
----  ----  ----  ----
123   3     1     123  

Upvotes: 5

Views: 4670

Answers (4)

Jason W
Jason W

Reputation: 13209

As many have confirmed, this is not the right way to concatenate all the rows in a column into a variable - even though in some cases it does "work". If you want to see some alternatives, please check out this blog.

According to MSDN (applies to SQL Server 2008 through 2014 and Azure SQL Database) , the SELECT should not be used to assign local variables. In the remarks, it describes how, when you do use the SELECT, it attempts to behave. The interesting points to note:

  • While typically it should only be used to return a single value to a variable, when the expression is the name of the column, it can return multiple values.
  • When the expression does return multiple values, the variable is assigned the last value that is returned.
  • If no value is returned, the variable retains its original value (not directly relevant here, but worth noting).

The first two points here are key - concatenation happens to work because SELECT @msg1 = @msg1 + cast(id as varchar) is essentially SELECT @msg1 += cast(id as varchar), and as the syntax notes, += is an accepted compound assignment operator on this expression. Please note here that it should not be expected this operation to continue to be supported on VARCHAR and to do string concatenation - just because it happens to work in some situations doesn't mean it is ok for production code.

The bottom line as to the underlying reason is whether the Compute Scalar that runs on the select expression uses the original id column or an expression of the id column. You probably can't find any docs on why the optimizer might choose the specific plans for each query, but each example highlights different use cases that allow the msg value to be evaluated from the column (and therefore multiple rows being returned and concatenated) or expression (and therefore only the last column).

  1. @msg1 is '123' because the Compute Scalar (the row-by-row evaluation of the variable assignment) occurs after the Sort. This allows the scalar computation to return multiple values on the id column concatenating them through the += compound operator. I doubt there is specific documentation why, but it appears the optimizer chose to do the sort before the scalar computation because the order by was a column and not an expression.

  2. @msg2 is '3' because the Compute Scalar is done before the sort, which leaves the @msg2 in each row just being the ('' + id) - so never concatenated, just the value of the id. Again, probably not any documentation why the optimizer chose this, but it appears that since the order by was an expression, perhaps it needed to do the (id+id) in the order by as part of the scalar computation before it could sort. At this point, your original column is no longer referencing the source column, but it has been replaced by an expression. Therefore, as MSDN stated, your first column points to an expression, not a column, so the behavior assigns the last value of the result set to the variable in the SELECT. Since you sorted ASC, you get '3' here.

  3. @msg3 is '1' for the same reason as example 2, except you ordered DESC. Again, this becomes an expression in the evaluation - not the original column, so therefore the assignment gets the last value of the DESC order, so you get '1'.

  4. @msg4 is '123' again because the TOP operation forces an initial scalar evaluation of the ORDER BY so that it can determine your top 100 records. This is different than examples 2 and 3 in which the scalar computation contained both the order by and select computations which caused each example to be an expression and not refer back to the original column. Example 4 has the TOP separating the ORDER BY and SELECT computations, so after the SORT (TOP N SORT) is applied, it then does the scalar computation for the SELECT columns in which at this point you are still referencing the original column (not an expression of the column), and therefore it returns multiple rows allowing the concatenation to occur.

Sources:

Upvotes: 6

Stephan
Stephan

Reputation: 6018

All I can see is there is a difference in the execution plans. They all start with SELECT and end with Table Scan. The difference is in between, the Compute Scalar and the Sort.

@Msg1 has Compute Scalar then Sort. Results: 123
@Msg2 has Sort then Compute Scalar. Results: 3
@Msg3 has Sort then Compute Scalar. Results: 1

The fourth one is different because of the top. It still starts with select and ends with table scan, but it's different in the middle. It uses a different sort.

@Msg4 has Compute Scalar then Sort(Top N Sort) then Compute Scalar

Upvotes: 1

jtimperley
jtimperley

Reputation: 2544

SQL Server will calculate the results, then sort them, then return them. In the case of assigning a variable, only the first result will be used to populate your variable. You are receiving the first value from the sorted result sets, which can move around the order SQL Server will scan the records as well as the position within the results.

TOP will always produce special query plans as it immediately forces SQL Server to stick to the natural ordering of the results instead of producing query plans that would statistically reduce the number of records it must read.

To explain the differences, you'll have to refer to how SQL Server decided to implicitly sort the values to optimize the query.

Query 1

Insert -> Table Insert -> Constant Scan

Query 2

SELECT -> Compute Scalar -> Sort -> Table Scan

Query 3, and 4

SELECT -> Sort -> Compute Scalar -> Table Scan

Query 5 and 6 (using TOP)

SELECT -> Compute Scalar -> Sort (Top N) -> Compute Scalar -> Table Scan

I added Query 6:

select top (100)
    @msg5 = @msg5 + cast(id as varchar)
from @tbl
order by id+id desc

Upvotes: 2

Luaan
Luaan

Reputation: 63772

You're not supposed to set variables in a select that returns more than a single row. Consider this code:

select top 1 @msg1 = @msg1 + cast(id as varchar) from @tbl
order by id;

select top 1 @msg2 = @msg2 + cast(id as varchar) from @tbl
order by id+id;

select top 1 @msg3 = @msg3 + cast(id as varchar) from @tbl
order by id+id desc;

select top 1 @msg4 = @msg4 + cast(id as varchar) from @tbl
order by id+id;

Producing 1, 1, 3 and 1, respectively.

I'm pretty surprised it doesn't cause an exception, I was quite sure it used to forbid this outright.

The underlying point is still the same: the SQL engine isn't just executing some commands procedurally, one by one, as you might expect. It will build an execution plan that is tailored to be as efficient as possible (given many constraints).

On the other hand, assigning a variable is inherently procedural, and requires an explicit execution / evaluation order to work correctly.

You're combining the two approches - select id from @tbl order by id is a non-procedural query, but select @id = id from @tbl order by id is a mix of the procedural @id = id, and the very much non-procedural select.

Upvotes: 0

Related Questions