TheGeneral
TheGeneral

Reputation: 81493

Function to replace all non alpha-numeric and multiple whitespace characters with a single space

I am trying to write an efficient function to use in a calculated field which has the following characteristics

Example input

A   B@#%$$C &^%D 

Example output

a b c d

A normal regex pattern would match like so

[\W_]+

The following works, however I am not sure if there is a more efficient approach than using 2 loops ( O(n2) complexity at least) with PatIndex and Stuff, charindex and replace

Create Function [dbo].[Clean](@Temp nvarchar(1000))
Returns nvarchar(1000)
AS
Begin
    Declare @Pattern as varchar(50) = '%[^a-z0-9 ]%'

    While PatIndex(@Pattern, @Temp) > 0
        Set @Temp = Stuff(@Temp, PatIndex(@Pattern, @Temp), 1, ' ')

    while charindex('  ',@Temp  ) > 0
       set @Temp = replace(@Temp, '  ', ' ')

    Return LOWER(TRIM(@Temp))
End

Usage

Select dbo.Clean(' A   B@#%$$C &^%D ')

Result

a b c d

Is there potentially a single pass approach I can use, or a sneaky method I am not aware of?

Upvotes: 0

Views: 883

Answers (1)

Zhorov
Zhorov

Reputation: 29943

I'm not able to test the performance, but the following approach (without loops and based on some string manipulations) is an additional option.

Note, that you'll need at least SQL Server 2017 (for the TRANSLATE() call).

-- Input text and patterns
DECLARE @text varchar(1000) = ' A   B@#%$$C &^%D'
DECLARE @alphanumericpattern varchar(36) = 'abcdefghijklmnopqrstuvwxyz0123456789'
DECLARE @notalphanumericpattern varchar(1000) 

-- Trim and lower the input text
SELECT @text = RTRIM(LTRIM(LOWER(@text)))

-- Get not alpha-numeric characters
SELECT @notalphanumericpattern = 
   REPLACE(
      TRANSLATE(@text, @alphanumericpattern, REPLICATE('a', LEN(@alphanumericpattern))),
      'a',
      ''
   )

-- Replace all not alpha-numeric characters with a space
SELECT @text = 
   REPLACE(
      TRANSLATE(@text, @notalphanumericpattern, REPLICATE('$', LEN(@notalphanumericpattern))),
      '$',
      ' '
   )

-- Replace multiple spaces with a single space
SELECT @text =
   REPLACE(
      REPLACE(
         REPLACE(
            @text,
            ' ',
            '<>'
         ),
         '><',
         ''
     ),
     '<>',
     ' '
   )     

Result:

a b c d

Upvotes: 1

Related Questions