Jake Bullet
Jake Bullet

Reputation: 515

String Parsing VARCHAR Column

I'm working with a database where the details of a phone call a user places with a client are saved into a single VARCHAR(MAX) column. For detailed reporting purposes (SSRS), however, all of the various attributes embedded within the VARCHAR column need to be turned into separate columns. To accomplish this parsing, I'm using a cursor in a stored procedure which calls a UDF that handles the extraction of the desired substring.

When querying a relatively small number of rows (no more than a few hundred), the performance of the query is decent. However, when querying a larger number of rows (several thousand), performance is terrible and the query will often time out. So, I'm looking for an approach that scales far better than what I have in place.

The VARCHAR column contains full HTML markup. I've removed irrelevant stuff to illustrate how the data I'm interested in is stored:

<table>
<tr>
<td class="key">Dial Result:</td><td id="DialResult" class="value">No Answer</td></tr>
<tr><td class="key">Client Name:</td><td id="ClientName" class="value">SMITH, BOB</td></tr>
<tr><td class="key">Number Dialed:</td><td id="NumberDialed" class="value">5555555555 [Day]</td></tr>
<tr><td class="key">Dial Count:</td><td id="DialCount" class="value">1</td></tr><tr><td class="key">Contact Made:</td><td id="ContactMade" class="value">No</td></tr>
<tr><td class="key">Campaign Called On:</td><td id="CampaignCalledOn" class="value">TEST CAMPAIGN</td></tr>
<tr><td class="key">Call Outcome:</td><td id="CallOutcome" class="value">No answer</td></tr>
<tr><td class="key">Email Sent:</td><td id="EmailSent" class="value">No</td></tr>
<tr><td class="key">Do Not Call Requested:</td><td id="DoNotCallRequested" class="value">No</td></tr>
<tr><td class="key">Product Purchased:</td><td id="ProductPurchased" class="value">No</td></tr>
<tr><td class="key">Order Number:</td><td id="OrderNumber" class="value">N/A</td></tr><tr><td class="key">Order Dollar Value:</td><td id="PurchaseAmount" class="value">0.00</td></tr>
<tr><td class="key">Purchased SKUs:</td><td id="PurchasedSKUs" class="value">N/A</td></tr>
</table>

Currently, I have a stored procedure which uses a cursor to iterate over rows in the "live" database table. The procedure cursor looks like this:

OPEN MainNoteCursor;
FETCH NEXT FROM MainNoteCursor INTO @SequenceNumber,@ClientId,@ContactNumber,@UserDisplayName,@CreatorId,@DateCol,@NoteText

WHILE (@@FETCH_STATUS <> -1)
    BEGIN
        IF (@@FETCH_STATUS <> -2)
            BEGIN TRY
                DECLARE @NoteDate date = null
                DECLARE @DialResult varchar(255) = null
                DECLARE @ClientName varchar(255) = null
                DECLARE @NumberDialed varchar(255) = null
                DECLARE @DialCount int = null
                DECLARE @ContactMade varchar(3)  = null
                DECLARE @CampaignCalledOn varchar(255)  = null
                DECLARE @CallOutcome varchar(255) = null
                DECLARE @EmailSent varchar(3) = null
                DECLARE @DoNotCallRequested varchar(3) = null
                DECLARE @ProductPurchased varchar(3) = null
                DECLARE @OrderNumber varchar(255) = null
                DECLARE @PurchaseAmount money = null
                DECLARE @PurchasedSKUs varchar(255) = null

                SET @NoteDate = CONVERT(date, @DateCol)

                SET @DialResult = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="DialResult" class="value">'',''</td>'',1)
                SET @ClientName = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="ClientName" class="value">'',''</td>'',1)
                SET @NumberDialed = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="NumberDialed" class="value">'',''</td>'',1)
                SET @DialCount = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="DialCount" class="value">'',''</td>'',1)
                SET @ContactMade = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="ContactMade" class="value">'',''</td>'',1)
                SET @CampaignCalledOn = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="CampaignCalledOn" class="value">'',''</td>'',1)
                SET @CallOutcome = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="CallOutcome" class="value">'',''</td>'',1)
                SET @EmailSent = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="EmailSent" class="value">'',''</td>'',1)
                SET @DoNotCallRequested = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="DoNotCallRequested" class="value">'',''</td>'',1)
                SET @ProductPurchased = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="ProductPurchased" class="value">'',''</td>'',1)
                SET @OrderNumber = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="OrderNumber" class="value">'',''</td>'',1)
                SET @PurchaseAmount = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="PurchaseAmount" class="value">'',''</td>'',1)
                SET @PurchasedSKUs = [dbo].[FN_PARSE_TEXT] (@NoteText,''<TD id="PurchasedSKUs" class="value">'',''</td>'',1) 

                INSERT INTO @Return
                    ([SequenceNumber],[ClientId],[ContactNumber],[UserDisplayName],[CreatorId],
                    [DateCol],[NoteDate],[NoteText],[DialResult],[ClientName],[NumberDialed],[DialCount],
                    [ContactMade],[CampaignCalledOn],[CallOutcome],[EmailSent],[DoNotCallRequested],
                    [ProductPurchased],[OrderNumber],[PurchaseAmount],[PurchasedSKUs])
                VALUES
                    (@SequenceNumber,@ClientId,@ContactNumber,@UserDisplayName,@CreatorId,
                    @DateCol,@NoteDate,@NoteText,@DialResult,@ClientName,@NumberDialed,@DialCount,
                    @ContactMade,@CampaignCalledOn,@CallOutcome,@EmailSent,@DoNotCallRequested,
                    @ProductPurchased,@OrderNumber,@PurchaseAmount,@PurchasedSKUs)  
            END TRY             

            BEGIN CATCH
                --Nothing to do.
            END CATCH       
        FETCH NEXT FROM MainNoteCursor INTO @SequenceNumber,@ClientId,@ContactNumber,@UserDisplayName,@CreatorId,@DateCol,@NoteText
    END

CLOSE MainNoteCursor;
DEALLOCATE MainNoteCursor;

SELECT * FROM @Return

So, you can see that the cursor is parsing the full note body each time, extracting the values between two string delimiters, inserting the values into a table, and then returning the table.

When executed in SSMS, I get a result that looks like this:

enter image description here

The code in the FN_PARSE_TEXT uses CHARINDEX(), LEN(), SUBSTRING(), etc. to read the text between the "Start string" and the "end string". I won't post the full function because there's a lot of housekeeping code that's not relevant, but it can be distilled down to this:

--********************************************************************************
-- CHARINDEX returns the position of the fist character in @StartKey, but we need to
-- start reading after the *last* character in @StartKey.  Re-adjust the position
-- at which we''ll start reading the string.
--********************************************************************************

SET @StartKeyIndex = @StartKeyIndex + @StartKeyLength
SET @ReadLength = @EndKeyIndex - @StartKeyIndex

--********************************************************************************
-- Start / End positions have been determined, so now read out the calculated number
-- of characters and return to the calling code.
--********************************************************************************

SET @ReturnValue = LTRIM(RTRIM(SUBSTRING(@noteText, @StartKeyIndex, @ReadLength)))

So, after all of that background information (I can provide more detail, if needed), I'm looking for a better approach.

I've thought of creating an overnight batch process that does all of the parsing during off-hours and then dumping the data into a flattened table. I would then report of the flattened data instead of trying to parse out the details in real time. This, of course, opens its own can of worms and I'd ideally like to continue to hit the live data. Depending on the date range the users supply when running the query, upwards of 10,000 rows may be returned.

Another alternative I've thought of would be to use a CLR method to handle the looping and parsing - but I don't have a lot of experience there and I'm not sure that would be any more efficient than what I'm doing now.

I have read articles (https://sqlperformance.com/2012/07/t-sql-queries/split-strings - for example) that talk about things like CTE, but I don't really have much experience there, so I'm not sure about how to make the leap. Most of the "string parsing" articles I've read are geared towards scenarios where the delimiter is the same - things like parsing comma-separated strings. But since I have "rolling" delimiters, I'm not sure how to deal with that.

Anyway, if you're still with me, any input would be greatly appreciated.

Upvotes: 1

Views: 1810

Answers (3)

You can use XML to convert your values and after pivot the result

DECLARE @xml xml = '<table>
<tr>
<td class="key">Dial Result:</td><td id="DialResult" class="value">No Answer</td></tr>
<tr><td class="key">Client Name:</td><td id="ClientName" class="value">SMITH, BOB</td></tr>
<tr><td class="key">Number Dialed:</td><td id="NumberDialed" class="value">5555555555 [Day]</td></tr>
<tr><td class="key">Dial Count:</td><td id="DialCount" class="value">1</td></tr><tr><td class="key">Contact Made:</td><td id="ContactMade" class="value">No</td></tr>
<tr><td class="key">Campaign Called On:</td><td id="CampaignCalledOn" class="value">TEST CAMPAIGN</td></tr>
<tr><td class="key">Call Outcome:</td><td id="CallOutcome" class="value">No answer</td></tr>
<tr><td class="key">Email Sent:</td><td id="EmailSent" class="value">No</td></tr>
<tr><td class="key">Do Not Call Requested:</td><td id="DoNotCallRequested" class="value">No</td></tr>
<tr><td class="key">Product Purchased:</td><td id="ProductPurchased" class="value">No</td></tr>
<tr><td class="key">Order Number:</td><td id="OrderNumber" class="value">N/A</td></tr><tr><td class="key">Order Dollar Value:</td><td id="PurchaseAmount" class="value">0.00</td></tr>
<tr><td class="key">Purchased SKUs:</td><td id="PurchasedSKUs" class="value">N/A</td></tr>
</table>'


Select 
 Tbl.Col.value('@id','varchar(max)') Id
,Tbl.Col.value('.','varchar(max)') Value
FRom
 @xml.nodes('/table/tr/td[@class="value"]') Tbl(Col)  


SELECT 
 DialResult
,ClientName
,NumberDialed
,DialCount
,ContactMade
,CampaignCalledOn
,CallOutcome
,EmailSent
,DoNotCallRequested
,ProductPurchased
,OrderNumber
,PurchaseAmount
,PurchasedSKUs
FROM
(
Select 
 Tbl.Col.value('@id','varchar(max)') Id
,Tbl.Col.value('.','varchar(max)') Value
FRom
 @xml.nodes('/table/tr/td[@class="value"]') Tbl(Col) ) T Pivot 
 (Max(Value) FOR  Id In(DialResult
,ClientName
,NumberDialed
,DialCount
,ContactMade
,CampaignCalledOn
,CallOutcome
,EmailSent
,DoNotCallRequested
,ProductPurchased
,OrderNumber
,PurchaseAmount
,PurchasedSKUs)) p;

Result for first select

Id                                                 Value
-------------------------------------------------- --------------------------------------------------
DialResult                                         No Answer
ClientName                                         SMITH, BOB
NumberDialed                                       5555555555 [Day]
DialCount                                          1
ContactMade                                        No
CampaignCalledOn                                   TEST CAMPAIGN
CallOutcome                                        No answer
EmailSent                                          No
DoNotCallRequested                                 No
ProductPurchased                                   No
OrderNumber                                        N/A
PurchaseAmount                                     0.00
PurchasedSKUs                                      N/A

Result for second select

DialResult                                         ClientName                                         NumberDialed                                       DialCount                                          ContactMade                                        CampaignCalledOn                                   CallOutcome                                        EmailSent                                          DoNotCallRequested                                 ProductPurchased                                   OrderNumber                                        PurchaseAmount                                     PurchasedSKUs
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
No Answer                                          SMITH, BOB                                         5555555555 [Day]                                   1                                                  No                                                 TEST CAMPAIGN                                      No answer                                          No                                                 No                                                 No                                                 N/A                                                0.00                                               N/A

(1 row(s) affected)

Upvotes: 0

Xedni
Xedni

Reputation: 4695

Just to offer a tangible example of how you can do this with XML, here's how I'd do it. I only have one sample document to go off of here, but if you had multiple, the same would apply. You'd just get back multiple rows for each primary key.

;with src (Id, NoteText) as
(
    select 1, cast('<table>
<tr>
<td class="key">Dial Result:</td><td id="DialResult" class="value">No Answer</td></tr>
<tr><td class="key">Client Name:</td><td id="ClientName" class="value">SMITH, BOB</td></tr>
<tr><td class="key">Number Dialed:</td><td id="NumberDialed" class="value">5555555555 [Day]</td></tr>
<tr><td class="key">Dial Count:</td><td id="DialCount" class="value">1</td></tr><tr><td class="key">Contact Made:</td><td id="ContactMade" class="value">No</td></tr>
<tr><td class="key">Campaign Called On:</td><td id="CampaignCalledOn" class="value">TEST CAMPAIGN</td></tr>
<tr><td class="key">Call Outcome:</td><td id="CallOutcome" class="value">No answer</td></tr>
<tr><td class="key">Email Sent:</td><td id="EmailSent" class="value">No</td></tr>
<tr><td class="key">Do Not Call Requested:</td><td id="DoNotCallRequested" class="value">No</td></tr>
<tr><td class="key">Product Purchased:</td><td id="ProductPurchased" class="value">No</td></tr>
<tr><td class="key">Order Number:</td><td id="OrderNumber" class="value">N/A</td></tr><tr><td class="key">Order Dollar Value:</td><td id="PurchaseAmount" class="value">0.00</td></tr>
<tr><td class="key">Purchased SKUs:</td><td id="PurchasedSKUs" class="value">N/A</td></tr>
</table>' as xml)
)

select 
    a.*,
    t.c.value('td[1]', 'varchar(max)')
from src a
cross apply a.NoteText.nodes('table/tr') as t(c)

Documentation for XML shredding in SQL can be a little hit or miss, mostly because SQL is implementing xquery which is independent of both SQL and Microsoft. That said, for doing basic stuff, generally the MSDN documentation gets you started pretty well.

Here are a couple good articles

Upvotes: 2

Tab Alleman
Tab Alleman

Reputation: 31785

If you can make modifications to the database, I would do anything I could to avoid doing this parsing at query time. Especially if this is a query that if performed often by users for something like a CRM application.

I would either add all of these frequently-selected columns to my table, or I would create a separate table for them that can be linked to the primary key of this table.

Then, ideally, I would populate these columns at INSERT time. Preferably by having the INSERT go through a stored procedure, but if necessary, by putting an Insert Trigger on the table.

If I can't do it at INSERT time, then I'd run a job that does the parsing and populates the columns as often as practicality requires/allows.

With most CRM applications, SELECT performance is more critical than INSERT performance, because it impacts multiple rows, while INSERTS usually happen one at a time. So I would try to shift the performance burden there if at all possible.

DISCLAIMER: I've never played with the XML functions that Xedni references in his comments. So I cannot say what the performance difference would be between my suggestion and his.

Upvotes: 1

Related Questions