mark
mark

Reputation: 57

STRING_SPLIT with string position

I have a table with two columns of comma-separated strings. The way the data is formatted, the number of comma-separated items in both columns is equal, and the first value in colA is related to the first value in colB, and so on. (It's obviously not a very good data format, but it's what I'm working with.)

If I have the following row (PrimaryKeyID | column1 | column2):

1 | a,b,c | A,B,C

then in this data format, a & 1 are logically related, b & 2, etc.

I want to use STRING_SPLIT to split these columns, but using it twice obviously crosses them with each other, resulting in a total of 9 rows.

1 | a | A
1 | b | A    
1 | c | A    
1 | a | B    
1 | b | B    
1 | c | B    
1 | a | C
1 | b | C    
1 | c | C

What I want is just the 3 "logically-related" columns

1 | a | A
1 | b | B
1 | c | C

However, STRING_SPLIT(myCol,',') doesn't appear to save the String Position anywhere.

I have done the following:

SELECT tbl.ID,
      t1.Column1Value,
      t2.Column2Value
FROM myTable tbl
INNER JOIN (
   SELECT t.ID, 
       ss.value AS Column1Value, 
       ROW_NUMBER() OVER (PARTITION BY t.ID ORDER BY t.ID) as StringOrder
   FROM myTable t
   CROSS APPLY STRING_SPLIT(t.column1,',') ss
) t1 ON tbl.ID = t1.ID
INNER JOIN (
   SELECT t.ID, 
       ss.value AS Column2Value, 
       ROW_NUMBER() OVER (PARTITION BY t.ID ORDER BY t.ID) as StringOrder
   FROM myTable t
   CROSS APPLY STRING_SPLIT(t.column2,',') ss
) t1 ON tbl.ID = t2.ID AND t1.StringOrder = t2.StringOrder

This appears to work on my small test set, but in my opinion there is no reason to expect it to work guaranteed every time. The ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) is obviously a meaningless ordering, but it appears that, in absence of any real ordering, STRING_SPLIT is returning the values in the "default" order that they were already in. Is this "expected" behaviour? Can I count on this? Is there any other way of accomplishing what I'm attempting to do?

I got what I wanted (I think) with the following UDF. However it's pretty slow. Any suggestions?

CREATE FUNCTION fn.f_StringSplit(@string VARCHAR(MAX),@delimiter VARCHAR(1))
RETURNS @r TABLE
(
    Position INT,
    String VARCHAR(255)
)
AS
BEGIN

    DECLARE @current_position INT
    SET @current_position = 1

    WHILE CHARINDEX(@delimiter,@string) > 0 BEGIN
    
        INSERT INTO @r (Position,String) VALUES (@current_position, SUBSTRING(@string,1,CHARINDEX(@delimiter,@string) - 1))

        SET @current_position = @current_position + 1
        SET @string = SUBSTRING(@string,CHARINDEX(@delimiter,@string) + 1, LEN(@string) - CHARINDEX(@delimiter,@string))
        
    END

    --add the last one
    INSERT INTO @r (Position, String) VALUES(@current_position,@string)

    RETURN
END

Upvotes: 4

Views: 24281

Answers (8)

Lawrence McKellar
Lawrence McKellar

Reputation: 1

Here is a t-sql function that uses string_split and adds the ordinal column:

drop function if exists [dbo].[varchar_split2];
go
create function [dbo].[varchar_split2]
(
    @text varchar(max),
    @delimiter char(1) = ','
)
returns @result table ([Ordinal] int not null identity(1, 1) primary key, [Value] varchar(128) not null)
as
begin
    insert @result ([Value])
    select
        [Value]
    from
        string_split(@text, @delimiter)
    where
        0 != len([Value])
    ;
    return;
end;
go

Upvotes: 0

N8allan
N8allan

Reputation: 2268

Fortunately in newer SQL Server (Azure and 2022) an optional flag has been added to String_Split to include an "ordinal" column. If you are using a newer version of SQL Server, this finally provides a solution that is logically correct rather than implementation specific.

New definition:

String_Split(string, separator [, enable_ordinal])
e.g. String_Split('1.2.3', '.', 1)

Example:

with V as (select Value v, Ordinal n from String_Split('1.2.3', '.', 1))
    select
        (select v from V where n = 1) Major,
        (select v from V where n = 2) Minor,
        (select v from V where n = 3) Revision

Returns:

Major Minor Revision
----- ----- ---------
1     2     3        

Upvotes: 6

N8allan
N8allan

Reputation: 2268

The only way I've discovered to expressively maintain the order of the String_Split() function this is using the Row_Number() function with a literal value in the "order by".

For example:

declare @Version nvarchar(128)
set @Version = '1.2.3';

with V as (select value v, Row_Number() over (order by (select 0)) n from String_Split(@Version, '.'))
    select
        (select v from V where n = 1) Major,
        (select v from V where n = 2) Minor,
        (select v from V where n = 3) Revision

Returns:

Major Minor Revision
----- ----- ---------
1     2     3        

Update: if you are using a newer version of SQL Server, you can now provide an optional third bit argument which indicates that and ordinal column should also be included in the result. See my other answer here for more details.

Upvotes: 19

Amir Reza
Amir Reza

Reputation: 473

This is very simple

CREATE TABLE #a(
id [INT] IDENTITY(1,1) NOT NULL,
OrgId INT )

INSERT INTO #a
(
    OrgId
)
 SELECT  value FROM STRING_SPLIT('18,44,45,46,47,48,49,50,51,52,53', ',')

Select * from #a

Upvotes: 0

Sergey
Sergey

Reputation: 11

Mark, here is a solution I would use. Assuming that [column 1] in your table has the "key" values that are more less stable, and [column2] has corresponding "field" values that can be sometimes omitted or NULL:

  • There will be two extractions, one for [column 1] - which I assume is the Key, another for [column 2] - which I assume is the sort of "values" for the "key", they will be auto parsed then by STRING_SPLIT function.

  • These two INDEPENDENT result-sets will be then re-numbered based on the time of operation (which is always sequential). Take note, we renumber not by the field content or position of the comma etc, BUT by the timestamp.

  • Then they will get joined back together by LEFT OUTER JOIN; note not by INNER JOIN due to the fact that our "field values" could get omitted, while "keys" will always be there

Below is the TSQL code, as this is my first post to this site, hope it looks ok:

SELECT T1.ID, T1.KeyValue, T2.FieldValue
from (select t1.ID, row_number() OVER (PARTITION BY t1.ID ORDER BY current_timestamp) AS KeyRow, t2.value AS KeyValue 
from myTable t1
CROSS APPLY STRING_SPLIT(t1.column1,',')  as t2) T1
LEFT OUTER JOIN
(select t1.ID, row_number() OVER (PARTITION BY t1.ID ORDER BY current_timestamp) AS FieldRow, t3.value AS FieldValue 
from myTable t1
CROSS APPLY STRING_SPLIT(t1.column2,',')  as t3) T2 ON T1.ID = T2.ID AND T1.KeyRow = T2.FieldRow

Upvotes: 1

Mahmoud Nasr
Mahmoud Nasr

Reputation: 642

SELECT
PrimaryKeyID ,t2.items as column1, t1.items as column2 from [YourTableName]
cross Apply [dbo].[Split](column2) as t1
cross Apply [dbo].[Split](column1) as t2

Upvotes: 1

Solonotix
Solonotix

Reputation: 471

I'm a little late to this question, but I was just attempting the same thing with string_split since I've run into a performance problem of late. My experience with string splitters in T-SQL has led me to use recursive CTE's for most things containing fewer than 1,000 delimited values. Ideally, a CLR procedure would be used if you need ordinal in your string split.

That said, I've come to a similar conclusion as you on getting ordinal from string_split. You can see the queries and statistics below which, in order, are the bare string_split function, a CTE RowNumber of string_split, and then my personal string split CTE function I derived from this awesome write-up. The main difference between my CTE-based function and the one in the write-up is I made it an Inline-TVF instead of their implementation of a MultiStatement-TVF, which you can read about the differences here.

In my experiments I haven't seen a deviation using ROW_NUMBER on a constant returning the internal order of the delimited string, so I will be using it until such time as I find a problem with it, but if order is imperative in a business setting, I would probably recommend the Moden splitter featured in the first link above, which links to the author's article here since it is right in-line with the performance seen by the less safe string_split with RowNumber approach.

set nocount on;

declare
    @iter int = 0,
    @rowcount int,
    @val varchar(max) = '';

while len(@val) < 1e6
    select
        @val += replicate(concat(@iter, ','), 8e3),
        @iter += 1;

raiserror('Begin string_split Built-In', 0, 0) with nowait;

set statistics time, io on;

select
    *
from
    string_split(@val, ',')
where
    [value] > '';

select
    @rowcount = @@rowcount;

set statistics time, io off;

print '';
raiserror('End string_split Built-In | Return %d Rows', 0, 0, @rowcount) with nowait;
print '';
raiserror('Begin string_split Built-In with RowNumber', 0, 0) with nowait;

set statistics time, io on;

with cte
as  (
    select
        *,
        [group] = 1
    from
        string_split(@val, ',')
    where
        [value] > ''
    ),
    cteCount
as  (
    select
        *,
        [id] = row_number() over (order by [group])
    from
        cte
    )
select
    *
from
    cteCount;

select
    @rowcount = @@rowcount;

set statistics time, io off;

print '';
raiserror('End string_split Built-In with RowNumber | Return %d Rows', 0, 0, @rowcount) with nowait;
print '';
raiserror('Begin Moden String Splitter', 0, 0) with nowait;

set statistics time, io on;

select
    *
from
    dbo.SplitStrings_Moden(@val, ',')
where
    item > '';

select
    @rowcount = @@rowcount;

set statistics time, io off;

print '';
raiserror('End Moden String Splitter | Return %d Rows', 0, 0, @rowcount) with nowait;
print '';
raiserror('Begin Recursive CTE String Splitter', 0, 0) with nowait;

set statistics time, io on;

select
    *
from
    dbo.fn_splitByDelim(@val, ',')
where
    strValue > ''
option
    (maxrecursion 0);

select
    @rowcount = @@rowcount;

set statistics time, io off;

Statistics being

Begin string_split Built-In

 SQL Server Execution Times:
   CPU time = 2000 ms,  elapsed time = 5325 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End string_split Built-In | Return 331940 Rows

Begin string_split Built-In with RowNumber

 SQL Server Execution Times:
   CPU time = 2094 ms,  elapsed time = 8119 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End string_split Built-In with RowNumber | Return 331940 Rows

Begin Moden String Splitter
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 6 ms.

 SQL Server Execution Times:
   CPU time = 8734 ms,  elapsed time = 9009 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End Moden String Splitter | Return 331940 Rows

Begin Recursive CTE String Splitter
Table 'Worktable'. Scan count 2, logical reads 1991648, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 147188 ms,  elapsed time = 147480 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

End Recursive CTE String Splitter | Return 331940 Rows

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270391

Your idea is fine, but your order by is not using a stable sort. I think it is safer to do:

SELECT tbl.ID, t1.Column1Value, t2.Column2Value
FROM myTable tbl INNER JOIN
     (SELECT t.ID, ss.value AS Column1Value, 
             ROW_NUMBER() OVER (PARTITION BY t.ID
                                ORDER BY CHARINDEX(',' + ss.value + ',', ',' + t.column1 + ',')
                               ) as StringOrder
      FROM myTable t CROSS APPLY
           STRING_SPLIT(t.column1,',') ss
     ) t1
     ON tbl.ID = t1.ID INNER JOIN
     (SELECT t.ID, ss.value AS Column2Value, 
             ROW_NUMBER() OVER (PARTITION BY t.ID
                                ORDER BY CHARINDEX(',' + ss.value + ',', ',' + t.column2 + ',')
                               ) as StringOrder
      FROM myTable t CROSS APPLY
           STRING_SPLIT(t.column2, ',') ss
     ) t2
     ON tbl.ID = t2.ID AND t1.StringOrder = t2.StringOrder;

Note: This may not work as desired if the strings have non-adjacent duplicates.

Upvotes: 2

Related Questions