DaCh
DaCh

Reputation: 931

mssql select all nvarchar with wrong encoding

I am working with a old database where someone didn't encode the data the right way before inserting it into the database. which result in text like

"Wrong t�xt" (in my case the '�' is a ø).

I am looking for a way to find all rows where the column contains data like this, so i can correct it.

So far i tried using regex like

SELECT * FROM table WHERE ([colm] not like '[a-zA-Z\s]%') 

but no matter what i do, i can't find a way to select only the ones containing the '�'

a search like

    SELECT * FROM table WHERE ([colm] like '%�%') 

won't return anything either. (tried it, just in cases).

I been search for this on Google and here on Stackoverflow, but either there is no one having this problem, or I am searching for the wrong thing.

So if someone would be so kind to help me with this, I would be really happy.

Thanks for your time.

Upvotes: 0

Views: 2020

Answers (4)

Jeroen Mostert
Jeroen Mostert

Reputation: 28809

Assuming the character in the string really is U+FFFD REPLACEMENT CHARACTER (�), and it's not displayed as a replacement character because there are actually other bytes in there that can't be decoded properly, you can find it with

SELECT * FROM table WHERE [colm] LIKE N'%�%' COLLATE Latin1_General_BIN2

Or (to avoid any further issues with encoding mangling characters)

SELECT * FROM table WHERE [colm] LIKE N'%' + NCHAR(0xfffd) + N'%' COLLATE Latin1_General_BIN2

Unicode is required because � does not exist in any single-byte collation, and a binary collation is required because the regular collations treat � as if it did not occur in strings at all.

Upvotes: 2

Raul
Raul

Reputation: 3141

You could replace occurences of the U+FFFD REPLACEMENT CHARACTER (�) and compare it with the original value:

SELECT *
, CASE WHEN CONVERT(VARBINARY(MAX), t.colm) = CAST(REPLACE(CONVERT(VARBINARY(MAX), t.colm), 0xFDFF, 0x) AS VARBINARY(MAX)) THEN 1 ELSE 0 END AS EncodingCorrect
FROM (
    SELECT N'Wrong t�xt" (in my case the ''�'' is a ø)' AS colm
    UNION ALL
    SELECT 'Correct text'
    UNION ALL
    SELECT 'Wrong t?xt" (in my case the ''?'' is a ø)'
) t

@Jeroen Mostert's suggestion WHERE colm LIKE N'%�%' COLLATE Latin1_General_BIN2 seems like the better and more readable solution.

Upvotes: 0

Thom A
Thom A

Reputation: 96038

As Jeroen mentioned, using a binary seems to be the way to go. Personally I would suggest using NGrams4k here, but I built a quick tally table instead that does the job:

WITH N AS(
    SELECT N
    FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) N(N)),
Tally AS(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
    FROM N N1, N N2, N N3, N N4)
SELECT V.Colm
FROM (VALUES(N'Wrong t�xt" (in my case the ''�'' is a ø)'),
            (N'This string is ok'))V(colm)
     JOIN Tally T ON LEN(V.Colm) >= T.I
     CROSS APPLY (VALUES(SUBSTRING(V.Colm,T.I,1))) SS(C) 
GROUP BY V.colm
HAVING COUNT(CASE CONVERT(binary(2),SS.C) WHEN 0xFDFF THEN 1 END) > 0;

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1271161

Try this:

WHERE [colm] not like N'%[a-zA-Z]%'

Of course, this should return values with numbers, spaces, and punctuation.

Upvotes: 1

Related Questions