Pedro F
Pedro F

Reputation: 28

Split string into multiple rows and columns

I've a dataset similar to the one below (there are more columns but just using these for example purposes).

PersonId LocationId StartDate AttendanceString
123 987 2018-09-01 XXXXZZZZ######PPLL
234 678 2018-10-01 PPPPLL######ZZZZXX

What I need to achieve is split the AttendanceString column into multiple rows and multiple columns. The attendance string needs to be broken down every 2 characters and split into 2 different columns, representing a morning and afternoon period. An example will make this clearer so let's use the first record. The desired outcome is:

PersonId LocationId StartDate MorningAttendanceString AfternoonAttendanceString
123 987 2018-09-01 X X
123 987 2018-09-02 X X
123 987 2018-09-03 Z Z
123 987 2018-09-04 Z Z
123 987 2018-09-05 # #

For each string, we need to iterate until we reach the last character, adding new records into a table corresponding to a different day and with separate records for morning/afternoon.

I'm able to achieve the desired logic with the code at the end of this post. However, because this may involve about 90/100k records and most of them will need to be broken into 365 records, we're talking about 33-35M records that need to be created. I'm using a cursor with a while cycle to get these results and even though cursors are normally to be avoided, I don't really think it's the issue here. This normally takes about 30mins to run on an Azure SQL Database in a S6 Standard Tier.

Is there any option I'm missing to make this more efficient? Ideally I'd like to reduce the time it takes to process the data. I can't really use split_string since it expects a specific character used to break the string.

DECLARE @LocationID as int;
DECLARE @AttendanceString as varchar(1000);
DECLARE @StartDate as date;
DECLARE @PersonId as int;

DECLARE @MorningValue as char(1);
DECLARE @AfternoonValue as char(1);

DECLARE @DayDate as date;
DECLARE @AttendanceCursor as cursor;

DECLARE @i as int;

SET @AttendanceCursor = CURSOR LOCAL FAST_FORWARD FOR
SELECT 
    PersonId,
    LocationId, 
    StartDate, 
    AttendanceString
FROM 
    SourceTable
WHERE 
    StartDate >= '2019-08-01'

BEGIN
    
    SET NOCOUNT ON
    OPEN @AttendanceCursor 
    FETCH NEXT FROM @AttendanceCursor INTO @PersonId , @AttendanceString, @StartDate, @LocationId;
    WHILE @@FETCH_STATUS = 0
    BEGIN

        SET @i = 1;
        SET @DayDate = @StartDate;

        WHILE (@i < len(@AttendanceString))
        BEGIN
            SET @MorningValue = SUBSTRING(@AttendanceString,@i,1);
            SET @AfternoonValue = SUBSTRING(@AttendanceString,@i+1,1);
            
            BEGIN TRY
                INSERT INTO FinalTable
                SELECT @DayDate , @LocationId, @PersonId, @MorningValue @AfternoonValue
            END TRY
            BEGIN CATCH
               ...
            END CATCH
            
            SET @i = @i+2;
            SET @DayDate = DATEADD(DD,1,@DayDate );
        END

        FETCH NEXT FROM @AttendanceCursor INTO @PersonId , @AttendanceString, @StartDate, @LocationId;

    END  

    CLOSE @AttendanceCursor ;
    DEALLOCATE @AttendanceCursor ;

Upvotes: 1

Views: 1045

Answers (1)

wBob
wBob

Reputation: 14379

I rewrote this as a set based query instead of a cursor and this whole script, including the generation of 100k test records runs in about 40 seconds on my Azure SQL DB, which is a Serverless gen 5 with only 1 vCore. Work through the script to make sure you understand it.

NB I'm dropping the tables as this is a test rig - not production code:

------------------------------------------------------------------------------------------------
-- Setup START
------------------------------------------------------------------------------------------------

DROP TABLE IF EXISTS dbo.sourceTable
DROP TABLE IF EXISTS dbo.finalTable
GO

CREATE TABLE dbo.sourceTable
(
    PersonId            INT IDENTITY PRIMARY KEY,
    LocationId          INT,
    StartDate           DATE,
    AttendanceString    VARCHAR(1000)
)
GO

CREATE TABLE dbo.finalTable
(
    DayDate         DATE, 
    LocationId      INT, 
    PersonId        INT, 
    MorningValue    CHAR(1), 
    AfternoonValue  CHAR(1)
)
GO

-- Generate some test data
SET IDENTITY_INSERT dbo.sourceTable ON

INSERT INTO dbo.sourceTable ( PersonId, LocationId, StartDate, AttendanceString )
VALUES
    ( 123, 987, '2018-09-01', 'XXXXZZZZ######PPLL' ),
    ( 234, 678, '2018-10-01', 'PPPPLL######ZZZZXX' ),
    ( 567, 999, '2018-10-01', 'abcdefghijklmnopqr' )

SET IDENTITY_INSERT dbo.sourceTable OFF
GO

-- Setup END
------------------------------------------------------------------------------------------------



------------------------------------------------------------------------------------------------
-- Test Data START
------------------------------------------------------------------------------------------------

;WITH cte AS (
SELECT 1 rn, 1 locationId, CAST( '1 Jan 2018' AS DATE ) startDate, REPLACE( NEWID(), '-', '' ) AttendanceString
UNION ALL
SELECT rn + 1, rn % 42, DATEADD( day, 1, startDate ), REPLACE( NEWID(), '-', '' )
FROM cte
WHERE rn < 100
)
INSERT INTO dbo.sourceTable ( LocationId, StartDate, AttendanceString )
SELECT LocationId, StartDate, AttendanceString 
FROM cte
ORDER BY 1;
GO 1000


-- Test Data END
------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------
-- Rewritten query START
------------------------------------------------------------------------------------------------





DROP TABLE IF EXISTS #tmp;

;WITH cte AS (
SELECT 1 n, 1 x, 2 y
UNION ALL
SELECT n + 1, x + 2, y + 2
FROM cte
WHERE n < 20
)
SELECT
    personId,
    locationId,
    DATEADD( day, c.n - 1, startDate ) xdate,
    SUBSTRING ( attendanceString, c.x, 1 ) a,
    SUBSTRING ( attendanceString, c.y, 1 ) b

INTO #tmp

FROM dbo.sourceTable s
    CROSS APPLY cte c
WHERE c.y <= LEN(attendanceString);


select *
from sourceTable
WHERE personId = 999

select *
from #tmp
WHERE personId = 999

select *
from #tmp
WHERE locationId = 999

-- Rewritten query END
------------------------------------------------------------------------------------------------

Altered version of the script for longer attendanceIds here.

Upvotes: 3

Related Questions