blue piranha
blue piranha

Reputation: 3874

Efficiently replacing many characters from a string

I would like to know the most efficient way of removing any occurrence of characters like , ; / " from a varchar column.

I have a function like this but it is incredibly slow. The table has about 20 million records.

CREATE FUNCTION [dbo].[Udf_getcleanedstring] (@s VARCHAR(255))
returns VARCHAR(255)
AS
  BEGIN
      DECLARE @o VARCHAR(255)

      SET @o = Replace(@s, '/', '')
      SET @o = Replace(@o, '-', '')
      SET @o = Replace(@o, ';', '')
      SET @o = Replace(@o, '"', '')

      RETURN @o
  END 

Upvotes: 3

Views: 5075

Answers (4)

Martin Smith
Martin Smith

Reputation: 453628

Whichever method you use it is probably worth adding a

WHERE YourCol LIKE '%[/-;"]%'

Except if you suspect that a very large proportion of rows will in fact contain at least one of the characters that need to be stripped.

As you are using this in an UPDATE statement then simply adding the WITH SCHEMABINDING attribute can massively improve things and allow the UPDATE to proceed row by row rather than needing to cache the entire operation in a spool first for Halloween Protection

enter image description here

Nested REPLACE calls in TSQL are slow anyway though as they involve multiple passes through the strings.

You could knock up a CLR function as below (if you haven't worked with these before then they are very easy to deploy from an SSDT project as long as CLR execution is permitted on the server). The UPDATE plan for this too does not contain a spool.

The Regular Expression uses (?:) to denote a non capturing group with the various characters of interest separated by the alternation character | as /|-|;|\" (the " needs to be escaped in the string literal so is preceded by a slash).

using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

public partial class UserDefinedFunctions
{
    private static readonly Regex regexStrip = 
                        new Regex("(?:/|-|;|\")", RegexOptions.Compiled);

    [SqlFunction]
    public static SqlString StripChars(SqlString Input)
    {
        return Input.IsNull ?  null : regexStrip.Replace((string)Input, "");        
    }
}

Upvotes: 4

itzik Paz
itzik Paz

Reputation: 433

I want to show the huge performance differences between the using with 2 types of USER DIFINED FUNCTIONS:

  1. User TABLE function
  2. User SCALAR function

See the test example :

use AdventureWorks2012
go

-- create table for the test
create table dbo.FindString (ColA int identity(1,1) not null primary key,ColB varchar(max) );

declare @text varchar(max) =  'A web server can handle a Hypertext Transfer Protocol request either by reading 
a file from its file ; system based on the URL <> path or by handling the request using logic that is specific 
to the type of resource. In the case that special logic is invoked the query string will be available to that logic 
for use in its processing, along with the path component of the URL.';

-- init process in loop 1,000,000 
insert into dbo.FindString(ColB)
select @text 
go 1000000

-- use one of the scalar function from the answers which post in this thread
alter function [dbo].[udf_getCleanedString]
( 
@s varchar(max)
)
returns  varchar(max)
as
begin
return replace(replace(replace(replace(@s,'/',''),'-',''),';',''),'"','')
end
go
--
-- create from the function above new function an a table function ;
create function [dbo].[utf_getCleanedString]
( 
@s varchar(255)
)
returns  table 
as return
(
select  replace(replace(replace(replace(@s,'/',''),'-',''),';',''),'"','') as String
)
go

--
-- clearing the buffer cach
DBCC DROPCLEANBUFFERS ;
go
-- update process using USER TABLE FUNCTIO
update Dest with(rowlock) set
dest.ColB  = D.String
from dbo.FindString dest
cross apply utf_getCleanedString(dest.ColB) as D
go

DBCC DROPCLEANBUFFERS ;
go
-- update process using USER SCALAR FUNCTION
update Dest with(rowlock) set
dest.ColB  =  dbo.udf_getCleanedString(dest.ColB) 
from dbo.FindString dest
go

AND these are the execution plan : As you can see the UTF is much better the USF ,they 2 doing the same thing replacing string, but one return scalar and the other return as a table

As you can see the UTF is much better the USF

Another important parameter for you to see (SET STATISTICS IO ON ;)

SET STATISTICS IO ON

Upvotes: 2

redxdotrocks
redxdotrocks

Reputation: 44

Here is a similar question asked previously, I like this approach mentioned here.

How to Replace Multiple Characters in SQL?

declare @badStrings table (item varchar(50))

INSERT INTO @badStrings(item)
SELECT '>' UNION ALL
SELECT '<' UNION ALL
SELECT '(' UNION ALL
SELECT ')' UNION ALL
SELECT '!' UNION ALL
SELECT '?' UNION ALL
SELECT '@'

declare @testString varchar(100), @newString varchar(100)

set @teststring = 'Juliet ro><0zs my s0x()rz!!?!one!@!@!@!'
set @newString = @testString

SELECT @newString = Replace(@newString, item, '') FROM @badStrings

select @newString -- returns 'Juliet ro0zs my s0xrzone'

Upvotes: 0

Code Different
Code Different

Reputation: 93181

How about nesting them together in a single call:

 create function [dbo].[udf_getCleanedString]
 ( 
    @s varchar(255)
 )
 returns varchar(255)
 as
 begin
   return replace(replace(replace(replace(@s,'/',''),'-',''),';',''),'"','')
 end

Or you may want to do an UPDATE on the table itself for the first time. Scalar functions are pretty slow.

Upvotes: 0

Related Questions