Reputation: 28
I've a dataset similar to the one below (there are more columns but just using these for example purposes).
PersonId | LocationId | StartDate | AttendanceString |
---|---|---|---|
123 | 987 | 2018-09-01 | XXXXZZZZ######PPLL |
234 | 678 | 2018-10-01 | PPPPLL######ZZZZXX |
What I need to achieve is split the AttendanceString column into multiple rows and multiple columns. The attendance string needs to be broken down every 2 characters and split into 2 different columns, representing a morning and afternoon period. An example will make this clearer so let's use the first record. The desired outcome is:
PersonId | LocationId | StartDate | MorningAttendanceString | AfternoonAttendanceString |
---|---|---|---|---|
123 | 987 | 2018-09-01 | X | X |
123 | 987 | 2018-09-02 | X | X |
123 | 987 | 2018-09-03 | Z | Z |
123 | 987 | 2018-09-04 | Z | Z |
123 | 987 | 2018-09-05 | # | # |
For each string, we need to iterate until we reach the last character, adding new records into a table corresponding to a different day and with separate records for morning/afternoon.
I'm able to achieve the desired logic with the code at the end of this post. However, because this may involve about 90/100k records and most of them will need to be broken into 365 records, we're talking about 33-35M records that need to be created. I'm using a cursor with a while cycle to get these results and even though cursors are normally to be avoided, I don't really think it's the issue here. This normally takes about 30mins to run on an Azure SQL Database in a S6 Standard Tier.
Is there any option I'm missing to make this more efficient? Ideally I'd like to reduce the time it takes to process the data. I can't really use split_string since it expects a specific character used to break the string.
DECLARE @LocationID as int;
DECLARE @AttendanceString as varchar(1000);
DECLARE @StartDate as date;
DECLARE @PersonId as int;
DECLARE @MorningValue as char(1);
DECLARE @AfternoonValue as char(1);
DECLARE @DayDate as date;
DECLARE @AttendanceCursor as cursor;
DECLARE @i as int;
SET @AttendanceCursor = CURSOR LOCAL FAST_FORWARD FOR
SELECT
PersonId,
LocationId,
StartDate,
AttendanceString
FROM
SourceTable
WHERE
StartDate >= '2019-08-01'
BEGIN
SET NOCOUNT ON
OPEN @AttendanceCursor
FETCH NEXT FROM @AttendanceCursor INTO @PersonId , @AttendanceString, @StartDate, @LocationId;
WHILE @@FETCH_STATUS = 0
BEGIN
SET @i = 1;
SET @DayDate = @StartDate;
WHILE (@i < len(@AttendanceString))
BEGIN
SET @MorningValue = SUBSTRING(@AttendanceString,@i,1);
SET @AfternoonValue = SUBSTRING(@AttendanceString,@i+1,1);
BEGIN TRY
INSERT INTO FinalTable
SELECT @DayDate , @LocationId, @PersonId, @MorningValue @AfternoonValue
END TRY
BEGIN CATCH
...
END CATCH
SET @i = @i+2;
SET @DayDate = DATEADD(DD,1,@DayDate );
END
FETCH NEXT FROM @AttendanceCursor INTO @PersonId , @AttendanceString, @StartDate, @LocationId;
END
CLOSE @AttendanceCursor ;
DEALLOCATE @AttendanceCursor ;
Upvotes: 1
Views: 1045
Reputation: 14379
I rewrote this as a set based query instead of a cursor and this whole script, including the generation of 100k test records runs in about 40 seconds on my Azure SQL DB, which is a Serverless gen 5 with only 1 vCore. Work through the script to make sure you understand it.
NB I'm dropping the tables as this is a test rig - not production code:
------------------------------------------------------------------------------------------------
-- Setup START
------------------------------------------------------------------------------------------------
DROP TABLE IF EXISTS dbo.sourceTable
DROP TABLE IF EXISTS dbo.finalTable
GO
CREATE TABLE dbo.sourceTable
(
PersonId INT IDENTITY PRIMARY KEY,
LocationId INT,
StartDate DATE,
AttendanceString VARCHAR(1000)
)
GO
CREATE TABLE dbo.finalTable
(
DayDate DATE,
LocationId INT,
PersonId INT,
MorningValue CHAR(1),
AfternoonValue CHAR(1)
)
GO
-- Generate some test data
SET IDENTITY_INSERT dbo.sourceTable ON
INSERT INTO dbo.sourceTable ( PersonId, LocationId, StartDate, AttendanceString )
VALUES
( 123, 987, '2018-09-01', 'XXXXZZZZ######PPLL' ),
( 234, 678, '2018-10-01', 'PPPPLL######ZZZZXX' ),
( 567, 999, '2018-10-01', 'abcdefghijklmnopqr' )
SET IDENTITY_INSERT dbo.sourceTable OFF
GO
-- Setup END
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
-- Test Data START
------------------------------------------------------------------------------------------------
;WITH cte AS (
SELECT 1 rn, 1 locationId, CAST( '1 Jan 2018' AS DATE ) startDate, REPLACE( NEWID(), '-', '' ) AttendanceString
UNION ALL
SELECT rn + 1, rn % 42, DATEADD( day, 1, startDate ), REPLACE( NEWID(), '-', '' )
FROM cte
WHERE rn < 100
)
INSERT INTO dbo.sourceTable ( LocationId, StartDate, AttendanceString )
SELECT LocationId, StartDate, AttendanceString
FROM cte
ORDER BY 1;
GO 1000
-- Test Data END
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
-- Rewritten query START
------------------------------------------------------------------------------------------------
DROP TABLE IF EXISTS #tmp;
;WITH cte AS (
SELECT 1 n, 1 x, 2 y
UNION ALL
SELECT n + 1, x + 2, y + 2
FROM cte
WHERE n < 20
)
SELECT
personId,
locationId,
DATEADD( day, c.n - 1, startDate ) xdate,
SUBSTRING ( attendanceString, c.x, 1 ) a,
SUBSTRING ( attendanceString, c.y, 1 ) b
INTO #tmp
FROM dbo.sourceTable s
CROSS APPLY cte c
WHERE c.y <= LEN(attendanceString);
select *
from sourceTable
WHERE personId = 999
select *
from #tmp
WHERE personId = 999
select *
from #tmp
WHERE locationId = 999
-- Rewritten query END
------------------------------------------------------------------------------------------------
Altered version of the script for longer attendanceIds here.
Upvotes: 3