Henrov
Henrov

Reputation: 1620

find all nvarchar fields in database and do a replace(<field>,"CHAR(10)",'') on them

I get data through a xml file. I use a third party component for this. (Zapsysys , I am in no way affiliated with them but maybe somebody knows their product) The data in the XML looks like this:

<customer>
"Johnny"
</customer>

What I end up with in the table (customers) is a nvarchar (surname) with the following content:

CHAR(10)JohnnyCHAR(10)

This is in every nvarchar field that gets read from the XML. The component actually does extract what it reads. But those chars mess up quite a lot statements.

select * from customers where surname = 'Johnny' yields no results.

select * from customers where surname like '%Johnny%'

or

select * from customers where replace(surname,char(10),'') = 'Johnny do.

Not very pretty.

One way to solve this is to use views with a sh*tload of replace statements. But wouldn't it be great if I could run a procedure that wipes these CHAR(10) from every nvarchar field?

It must be possible to write an update statement that finds all the nvarchar fields and does a replace(,"CHAR(10)",'') on them?

To be more clear: I do know how updatestatements work. I'm looking for a way to avoid writing update statements for each field in my db of type (n)varchar

update:

came up with this code after a suggestion from @matt (see answer marked as solution)

declare @temptable table (id 

    int identity(1,1), sql nvarchar(4000))

        insert into @temptable(sql)
        SELECT 'UPDATE '+quotename(i.TABLE_SCHEMA)+'.'+quotename(i.TABLE_NAME) +' SET 
        '+quotename(i.COLUMN_NAME)+' = REPLACE('+quotename(i.COLUMN_NAME)+', CHAR(10),'''')'
        FROM INFORMATION_SCHEMA.COLUMNS  i
        inner join sys.tables t on i.TABLE_NAME = t.name
        WHERE DATA_TYPE = 'NVARCHAR' 
        and t.type = 'U'
        and TABLE_SCHEMA = 'myschema'


        declare @i as int = 1
        declare @sql as nvarchar(max)
        declare @max as int = (select max(id) from @temptable)

        while @i <= @max
        BEGIN
        set @sql = (select [sql] from @temptable where id = @i)
        exec sp_executesql @sql
        --print cast(@i as varchar(5)) + '/'+cast(@max as varchar(5)) + ' done, ' +cast(@max-@i as varchar(5)) + ' to go...'
        set @sql = ''
        set @i = @i+1
        END

Upvotes: 0

Views: 281

Answers (3)

Alan Burstein
Alan Burstein

Reputation: 7928

First you'll need a good N-Grams function such as the one covered here. The version I am including below is the NVARCHAR(4000) version (Kudos to Larnu for his contribution.) I used NGramsN4K to build a NVARCHAR(4000) PatReplace function. I use different schemas for my functions but dbo will work just fine.

Note that this:

SELECT pr.NewString 
FROM   samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;

Returns: ABC123XYZ

All characters that don't match this pattern: [^0-9a-zA-Z] have been excluded. Now let's use the function against a table with records containing bad characters, remove them, then join them to a table with good values. Note my comments.

-- Sample data
DECLARE @Customers  TABLE (CustomerId INT IDENTITY, Surname NVARCHAR(100));
DECLARE @GoodValues TABLE (Surname NVARCHAR(100));

INSERT @Customers  (Surname) VALUES (CHAR(10)+'Johnny'+CHAR(10)),('Smith'),('Jones'+CHAR(160));
INSERT @goodvalues (Surname) VALUES('Johnny'),('Smith'),('Jones'),('James');

-- Fail:
SELECT c.CustomerId, g.Surname
FROM   @Customers  AS c
JOIN   @GoodValues AS g 
  ON   c.Surname = g.Surname;

-- Success:
SELECT c.CustomerId, g.Surname
FROM        @Customers  AS c
CROSS APPLY samd.patreplaceN4K(c.Surname,'[^0-9a-zA-Z ]','') AS pr
JOIN        @GoodValues AS g 
  ON        pr.newString = g.Surname;

samd.NGramsN4K

CREATE FUNCTION samd.NGramsN4K
(
  @string NVARCHAR(4000), -- Input string
  @N      INT             -- requested token size
)
/*****************************************************************************************
[Purpose]:
 A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens 
 based on an input string (@string). Accepts strings up to 4000 NVARCHAR characters long.
 For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram. 

[Author]:
  Alan Burstein

[Compatibility]:
 SQL Server 2008+, Azure SQL Database

[Syntax]:
--===== Autonomous
 SELECT ng.position, ng.token
 FROM   samd.NGramsN4K(@string,@N) AS ng;

--===== Against a table using APPLY
 SELECT      s.SomeID, ng.position, ng.token
 FROM        dbo.SomeTable                  AS s
 CROSS APPLY samd.NGramsN4K(s.SomeValue,@N) AS ng;

[Parameters]:
 @string  = The input string to split into tokens.
 @N       = The size of each token returned.

[Returns]:
 Position = bigint; the position of the token in the input string
 token    = NVARCHAR(4000); a @N-sized character-level N-Gram token

[Dependencies]:
 1. core.rangeAB (iTVF)

[Developer Notes]:
 1. NGramsN4K is not case sensitive

 2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
    creates a parallel execution plan. One way to get a parallel query plan (if the 
    optimizer does not chose one) is to use make_parallel by Adam Machanic which can be 
    found here:
 sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx

 3. When @N is less than 1 or greater than the datalength of the input string then no 
    tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
    This is a debatable topic but the thinking behind this decision is that: because you
    can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you 
    can't turn anything into NULL-grams, no rows should be returned.

    For people who would prefer that a NULL input forces the function to return a single
    NULL output you could add this code to the end of the function:

    UNION ALL 
    SELECT 1, NULL
    WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);

 4. NGramsN4K is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx

[Examples]:
--===== 1. Turn the string, 'ɰɰXɰɰ' into unigrams, bigrams and trigrams
 DECLARE @string NVARCHAR(4000) = N'ɰɰXɰɰ';
 BEGIN
   SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,1) AS ng; -- unigrams (@N=1)
   SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,2) AS ng; -- bigrams  (@N=2)
   SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,3) AS ng; -- trigrams (@N=3)
   SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,4) AS ng; -- 4-grams  (@N=4)
 END

--===== 2. Scenarios where the function would not return rows
 SELECT ng.Position, ng.Token FROM samd.NGramsN4K('abcd',5)   AS ng; -- 5-grams  (@N=5)
 SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', 0)    AS ng;
 SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', NULL) AS ng;

 This will fail:
 --SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x',-1)    AS ng;

--===== 3. How many times the substring "ƒƓ" appears in each record
 BEGIN
   DECLARE @table TABLE(stringID int identity primary key, string NVARCHAR(100));
   INSERT @table(string)
   VALUES (N'ƒƓ123ƒƓ'),(N'123ƒƓƒƓƒƓ'),(N'!ƒƓ!ƒƓ!'),(N'ƒƓ-ƒƓ-ƒƓ-ƒƓ-ƒƓ');

   SELECT t.String, Occurances = COUNT(*) 
   FROM @table                            AS t
   CROSS APPLY samd.NGramsN4K(t.string,2) AS ng
   WHERE       ng.token = N'ƒƓ'
   GROUP BY    t.string;
 END;
-----------------------------------------------------------------------------------------
[Revision History]:
 Rev 00 - 20170324 - Initial Development - Alan Burstein
 Rev 01 - 20180829 - Changed TOP logic and startup-predicate logic in the WHERE clause
                   - Alan Burstein
 Rev 02 - 20191129 - Redesigned to leverage rangeAB - Alan Burstein
 Rev 03 - 20200416 - changed the cast from NCHAR(4000) to NVARCHAR(4000)
                   - Removed: WHERE @N BETWEEN 1 AND s.Ln; this must now be handled
                     manually moving forward. - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
  Position = r.RN,                                              -- Token Position
  Token    = CAST(SUBSTRING(@string,r.RN,@N) AS NVARCHAR(4000)) -- @N-Sized Token
FROM        (VALUES(DATALENGTH(ISNULL(NULLIF(@string,N''),N'X'))/2)) AS s(Ln)
CROSS APPLY core.rangeAB(1,s.Ln-(ISNULL(@N,1)-1),1,1)                AS r
GO

samd.patReplaceN4K

CREATE FUNCTION samd.patReplaceN4K
(
  @string  NVARCHAR(4000), -- Input String
  @pattern NVARCHAR(50),   -- Pattern to match/replace
  @replace NVARCHAR(20)    -- What to replace the matched pattern with
)
/*****************************************************************************************
[Purpose]:
 Given a string (@string), a pattern (@pattern), and a replacement character (@replace)
 patReplaceN4K will replace any character in @string that matches the @Pattern parameter 
 with the character, @replace.

[Author]:
 Alan Burstein

[Compatibility]:
  SQL Server 2008+

[Syntax]:
--===== Basic Syntax Example
 SELECT pr.NewString
 FROM   samd.patReplaceN4K(@String,@Pattern,@Replace) AS pr;

[Parameters]:
 @string  = NVARCHAR(4000); The input string to manipulate
 @pattern = NVARCHAR(50);   The pattern to match/replace
 @replace = NVARCHAR(20);   What to replace the matched pattern with

[Returns]:
 Inline Table Valued Function returns:
 NewString = NVARCHAR(4000); The new string with all instances of @Pattern replaced with
             The value of @Replace.

[Dependencies]:
 core.ngramsN4k (ITVF)

[Developer Notes]:
 1. @Pattern IS case sensitive but can be easily modified to make it case insensitive
 2. There is no need to include the "%" before and/or after your pattern since since we 
    are evaluating each character individually
 3. Certain special characters, such as "$" and "%" need to be escaped with a "/"
    like so: [/$/%]
 4. As is the case with functions which leverage samd.ngrams or samd.ngramsN4k, 
    samd.patReplaceN4K is almost always dramatically faster with a parallel execution
    plan. One way to get a parallel query plan (if the optimizer does not choose one) is
    to use make_parallel by Adam Machanic found here:
  sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx

    On my PC (8 logical CPU, 64GB RAM, SQL 2019) samd.patReplaceN4K is about 4X
    faster when executed using all 8 of my logical CPUs. 
 5. samd.patReplaceN4K is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx

[Examples]:
--===== 1. Remove non alphanumeric characters
 SELECT pr.NewString 
 FROM   samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;

--===== 2. Replace numeric characters with a "*"
 SELECT pr.NewString
 FROM  samd.patReplaceN4K('My phone number is 555-2211','[0-9]','*') AS pr;

--==== 3. Using againsts a table
 DECLARE @table TABLE(OldString varchar(60));
 INSERT  @table VALUES ('Call me at 555-222-6666'), ('phone number: (312)555-2323'),
                       ('He can be reached at 444.665.4466 on Monday.');

 SELECT      t.OldString, pr.NewString
 FROM        @table                                     AS t
 CROSS APPLY samd.patReplaceN4K(t.oldstring,'[0-9]','*') AS pr;

[Revision History]:
-----------------------------------------------------------------------------------------
Rev 01  - 20200422 - Created - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString = 
(
  SELECT CASE WHEN @string = a.Blank THEN a.Blank ELSE
           CASE WHEN PATINDEX(@pattern,a.Token)&0x01=0 THEN ng.token ELSE @replace END END
  FROM        samd.NGramsN4K(@string,1) AS ng
  CROSS APPLY (VALUES(CAST('' AS NVARCHAR(4000)),
                      ng.token COLLATE Latin1_General_BIN)) AS a(Blank,Token)
  ORDER BY ng.position
  FOR XML PATH(''),TYPE
).value('text()[1]', 'NVARCHAR(4000)');
GO

Upvotes: 0

Matt
Matt

Reputation: 835

Sure, you could run an update on that surname field as part of your import. Something like this would work for you:

UPDATE customers
SET surname = replace(surname,char(10),'')

Or you could use some dynamic SQL like this to generate up the update statements, you could quickly change this so that it executes:

SELECT 'UPDATE '+TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+COLUMN_NAME+' SET 
'+COLUMN_NAME+' = REPLACE('+COLUMN_NAME+', CHAR(10),'''')'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE = 'NVARCHAR'

Upvotes: 1

KeithL
KeithL

Reputation: 5594

This should give you a list of columns to build a cursor around:

select COLUMN_NAME
from INFORMATION_SCHEMA.COLUMNS
where DATA_TYPE in ('varchar','nvarchar')
    and TABLE_NAME = [your table name]

This one works much smoother.

Upvotes: 0

Related Questions