Reputation: 931
I am working with a old database where someone didn't encode the data the right way before inserting it into the database. which result in text like
"Wrong t�xt" (in my case the '�' is a ø).
I am looking for a way to find all rows where the column contains data like this, so i can correct it.
So far i tried using regex like
SELECT * FROM table WHERE ([colm] not like '[a-zA-Z\s]%')
but no matter what i do, i can't find a way to select only the ones containing the '�'
a search like
SELECT * FROM table WHERE ([colm] like '%�%')
won't return anything either. (tried it, just in cases).
I been search for this on Google and here on Stackoverflow, but either there is no one having this problem, or I am searching for the wrong thing.
So if someone would be so kind to help me with this, I would be really happy.
Thanks for your time.
Upvotes: 0
Views: 2020
Reputation: 28809
Assuming the character in the string really is U+FFFD REPLACEMENT CHARACTER
(�), and it's not displayed as a replacement character because there are actually other bytes in there that can't be decoded properly, you can find it with
SELECT * FROM table WHERE [colm] LIKE N'%�%' COLLATE Latin1_General_BIN2
Or (to avoid any further issues with encoding mangling characters)
SELECT * FROM table WHERE [colm] LIKE N'%' + NCHAR(0xfffd) + N'%' COLLATE Latin1_General_BIN2
Unicode is required because � does not exist in any single-byte collation, and a binary collation is required because the regular collations treat � as if it did not occur in strings at all.
Upvotes: 2
Reputation: 3141
You could replace occurences of the U+FFFD REPLACEMENT CHARACTER
(�) and compare it with the original value:
SELECT *
, CASE WHEN CONVERT(VARBINARY(MAX), t.colm) = CAST(REPLACE(CONVERT(VARBINARY(MAX), t.colm), 0xFDFF, 0x) AS VARBINARY(MAX)) THEN 1 ELSE 0 END AS EncodingCorrect
FROM (
SELECT N'Wrong t�xt" (in my case the ''�'' is a ø)' AS colm
UNION ALL
SELECT 'Correct text'
UNION ALL
SELECT 'Wrong t?xt" (in my case the ''?'' is a ø)'
) t
@Jeroen Mostert's suggestion WHERE colm LIKE N'%�%' COLLATE Latin1_General_BIN2
seems like the better and more readable solution.
Upvotes: 0
Reputation: 96038
As Jeroen mentioned, using a binary
seems to be the way to go. Personally I would suggest using NGrams4k
here, but I built a quick tally table instead that does the job:
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) N(N)),
Tally AS(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS I
FROM N N1, N N2, N N3, N N4)
SELECT V.Colm
FROM (VALUES(N'Wrong t�xt" (in my case the ''�'' is a ø)'),
(N'This string is ok'))V(colm)
JOIN Tally T ON LEN(V.Colm) >= T.I
CROSS APPLY (VALUES(SUBSTRING(V.Colm,T.I,1))) SS(C)
GROUP BY V.colm
HAVING COUNT(CASE CONVERT(binary(2),SS.C) WHEN 0xFDFF THEN 1 END) > 0;
Upvotes: 1
Reputation: 1271161
Try this:
WHERE [colm] not like N'%[a-zA-Z]%'
Of course, this should return values with numbers, spaces, and punctuation.
Upvotes: 1