Reputation: 1620
I get data through a xml file. I use a third party component for this. (Zapsysys , I am in no way affiliated with them but maybe somebody knows their product) The data in the XML looks like this:
<customer>
"Johnny"
</customer>
What I end up with in the table (customers) is a nvarchar (surname) with the following content:
CHAR(10)JohnnyCHAR(10)
This is in every nvarchar field that gets read from the XML. The component actually does extract what it reads. But those chars mess up quite a lot statements.
select * from customers where surname = 'Johnny'
yields no results.
select * from customers where surname like '%Johnny%'
or
select * from customers where replace(surname,char(10),'') = 'Johnny
do.
Not very pretty.
One way to solve this is to use views with a sh*tload of replace statements. But wouldn't it be great if I could run a procedure that wipes these CHAR(10) from every nvarchar field?
It must be possible to write an update statement that finds all the nvarchar fields and does a replace(,"CHAR(10)",'') on them?
To be more clear: I do know how updatestatements work. I'm looking for a way to avoid writing update statements for each field in my db of type (n)varchar
update:
came up with this code after a suggestion from @matt (see answer marked as solution)
declare @temptable table (id
int identity(1,1), sql nvarchar(4000))
insert into @temptable(sql)
SELECT 'UPDATE '+quotename(i.TABLE_SCHEMA)+'.'+quotename(i.TABLE_NAME) +' SET
'+quotename(i.COLUMN_NAME)+' = REPLACE('+quotename(i.COLUMN_NAME)+', CHAR(10),'''')'
FROM INFORMATION_SCHEMA.COLUMNS i
inner join sys.tables t on i.TABLE_NAME = t.name
WHERE DATA_TYPE = 'NVARCHAR'
and t.type = 'U'
and TABLE_SCHEMA = 'myschema'
declare @i as int = 1
declare @sql as nvarchar(max)
declare @max as int = (select max(id) from @temptable)
while @i <= @max
BEGIN
set @sql = (select [sql] from @temptable where id = @i)
exec sp_executesql @sql
--print cast(@i as varchar(5)) + '/'+cast(@max as varchar(5)) + ' done, ' +cast(@max-@i as varchar(5)) + ' to go...'
set @sql = ''
set @i = @i+1
END
Upvotes: 0
Views: 281
Reputation: 7928
First you'll need a good N-Grams function such as the one covered here. The version I am including below is the NVARCHAR(4000) version (Kudos to Larnu for his contribution.) I used NGramsN4K to build a NVARCHAR(4000) PatReplace function. I use different schemas for my functions but dbo will work just fine.
Note that this:
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
Returns: ABC123XYZ
All characters that don't match this pattern: [^0-9a-zA-Z]
have been excluded. Now let's use the function against a table with records containing bad characters, remove them, then join them to a table with good values. Note my comments.
-- Sample data
DECLARE @Customers TABLE (CustomerId INT IDENTITY, Surname NVARCHAR(100));
DECLARE @GoodValues TABLE (Surname NVARCHAR(100));
INSERT @Customers (Surname) VALUES (CHAR(10)+'Johnny'+CHAR(10)),('Smith'),('Jones'+CHAR(160));
INSERT @goodvalues (Surname) VALUES('Johnny'),('Smith'),('Jones'),('James');
-- Fail:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
JOIN @GoodValues AS g
ON c.Surname = g.Surname;
-- Success:
SELECT c.CustomerId, g.Surname
FROM @Customers AS c
CROSS APPLY samd.patreplaceN4K(c.Surname,'[^0-9a-zA-Z ]','') AS pr
JOIN @GoodValues AS g
ON pr.newString = g.Surname;
samd.NGramsN4K
CREATE FUNCTION samd.NGramsN4K
(
@string NVARCHAR(4000), -- Input string
@N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 4000 NVARCHAR characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.position, ng.token
FROM samd.NGramsN4K(@string,@N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable AS s
CROSS APPLY samd.NGramsN4K(s.SomeValue,@N) AS ng;
[Parameters]:
@string = The input string to split into tokens.
@N = The size of each token returned.
[Returns]:
Position = bigint; the position of the token in the input string
token = NVARCHAR(4000); a @N-sized character-level N-Gram token
[Dependencies]:
1. core.rangeAB (iTVF)
[Developer Notes]:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When @N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Turn the string, 'ɰɰXɰɰ' into unigrams, bigrams and trigrams
DECLARE @string NVARCHAR(4000) = N'ɰɰXɰɰ';
BEGIN
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,1) AS ng; -- unigrams (@N=1)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,2) AS ng; -- bigrams (@N=2)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,3) AS ng; -- trigrams (@N=3)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(@string,4) AS ng; -- 4-grams (@N=4)
END
--===== 2. Scenarios where the function would not return rows
SELECT ng.Position, ng.Token FROM samd.NGramsN4K('abcd',5) AS ng; -- 5-grams (@N=5)
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', 0) AS ng;
SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x', NULL) AS ng;
This will fail:
--SELECT ng.Position, ng.Token FROM samd.NGramsN4K(N'x',-1) AS ng;
--===== 3. How many times the substring "ƒƓ" appears in each record
BEGIN
DECLARE @table TABLE(stringID int identity primary key, string NVARCHAR(100));
INSERT @table(string)
VALUES (N'ƒƓ123ƒƓ'),(N'123ƒƓƒƓƒƓ'),(N'!ƒƓ!ƒƓ!'),(N'ƒƓ-ƒƓ-ƒƓ-ƒƓ-ƒƓ');
SELECT t.String, Occurances = COUNT(*)
FROM @table AS t
CROSS APPLY samd.NGramsN4K(t.string,2) AS ng
WHERE ng.token = N'ƒƓ'
GROUP BY t.string;
END;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20180829 - Changed TOP logic and startup-predicate logic in the WHERE clause
- Alan Burstein
Rev 02 - 20191129 - Redesigned to leverage rangeAB - Alan Burstein
Rev 03 - 20200416 - changed the cast from NCHAR(4000) to NVARCHAR(4000)
- Removed: WHERE @N BETWEEN 1 AND s.Ln; this must now be handled
manually moving forward. - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
Position = r.RN, -- Token Position
Token = CAST(SUBSTRING(@string,r.RN,@N) AS NVARCHAR(4000)) -- @N-Sized Token
FROM (VALUES(DATALENGTH(ISNULL(NULLIF(@string,N''),N'X'))/2)) AS s(Ln)
CROSS APPLY core.rangeAB(1,s.Ln-(ISNULL(@N,1)-1),1,1) AS r
GO
samd.patReplaceN4K
CREATE FUNCTION samd.patReplaceN4K
(
@string NVARCHAR(4000), -- Input String
@pattern NVARCHAR(50), -- Pattern to match/replace
@replace NVARCHAR(20) -- What to replace the matched pattern with
)
/*****************************************************************************************
[Purpose]:
Given a string (@string), a pattern (@pattern), and a replacement character (@replace)
patReplaceN4K will replace any character in @string that matches the @Pattern parameter
with the character, @replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplaceN4K(@String,@Pattern,@Replace) AS pr;
[Parameters]:
@string = NVARCHAR(4000); The input string to manipulate
@pattern = NVARCHAR(50); The pattern to match/replace
@replace = NVARCHAR(20); What to replace the matched pattern with
[Returns]:
Inline Table Valued Function returns:
NewString = NVARCHAR(4000); The new string with all instances of @Pattern replaced with
The value of @Replace.
[Dependencies]:
core.ngramsN4k (ITVF)
[Developer Notes]:
1. @Pattern IS case sensitive but can be easily modified to make it case insensitive
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
3. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
4. As is the case with functions which leverage samd.ngrams or samd.ngramsN4k,
samd.patReplaceN4K is almost always dramatically faster with a parallel execution
plan. One way to get a parallel query plan (if the optimizer does not choose one) is
to use make_parallel by Adam Machanic found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
On my PC (8 logical CPU, 64GB RAM, SQL 2019) samd.patReplaceN4K is about 4X
faster when executed using all 8 of my logical CPUs.
5. samd.patReplaceN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Remove non alphanumeric characters
SELECT pr.NewString
FROM samd.patReplaceN4K('ൈൈƐABCƐƐ123ˬˬˬˬXYZˤˤ','[^0-9a-zA-Z]','') AS pr;
--===== 2. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplaceN4K('My phone number is 555-2211','[0-9]','*') AS pr;
--==== 3. Using againsts a table
DECLARE @table TABLE(OldString varchar(60));
INSERT @table VALUES ('Call me at 555-222-6666'), ('phone number: (312)555-2323'),
('He can be reached at 444.665.4466 on Monday.');
SELECT t.OldString, pr.NewString
FROM @table AS t
CROSS APPLY samd.patReplaceN4K(t.oldstring,'[0-9]','*') AS pr;
[Revision History]:
-----------------------------------------------------------------------------------------
Rev 01 - 20200422 - Created - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT CASE WHEN @string = a.Blank THEN a.Blank ELSE
CASE WHEN PATINDEX(@pattern,a.Token)&0x01=0 THEN ng.token ELSE @replace END END
FROM samd.NGramsN4K(@string,1) AS ng
CROSS APPLY (VALUES(CAST('' AS NVARCHAR(4000)),
ng.token COLLATE Latin1_General_BIN)) AS a(Blank,Token)
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('text()[1]', 'NVARCHAR(4000)');
GO
Upvotes: 0
Reputation: 835
Sure, you could run an update on that surname field as part of your import. Something like this would work for you:
UPDATE customers
SET surname = replace(surname,char(10),'')
Or you could use some dynamic SQL like this to generate up the update statements, you could quickly change this so that it executes:
SELECT 'UPDATE '+TABLE_CATALOG+'.'+TABLE_SCHEMA+'.'+COLUMN_NAME+' SET
'+COLUMN_NAME+' = REPLACE('+COLUMN_NAME+', CHAR(10),'''')'
FROM INFORMATION_SCHEMA.COLUMNS
WHERE DATA_TYPE = 'NVARCHAR'
Upvotes: 1
Reputation: 5594
This should give you a list of columns to build a cursor around:
select COLUMN_NAME
from INFORMATION_SCHEMA.COLUMNS
where DATA_TYPE in ('varchar','nvarchar')
and TABLE_NAME = [your table name]
This one works much smoother.
Upvotes: 0