(T)SQL Query to filter all datasets, which contain �

Question

I've just received a (MSSQL) database, which unfortunately has corrupted data in it. In some datasets German special characters (ä/ü/ö) were substitued for a � symbol (Unicode: \uFFFD).

This is not a problem with the collation or the use of various encodings or anything (I checked that already), but rather the fault of whoever provided the orignal data, as I am pretty sure that these symbols are 'hardcoded' into the data meaning that they were "always" there and probably came into existence by a failed migration (or something similar) by the data provider.

As I can't get my hands on a proper dump without these errors, I am currently looking for the possibility to select only the datasets affected. I already tried a [...] WHERE LIKE '%\uFFFD%' statement (hoping that the DB internally decodes these symbols to the unicode code), but to no avail. [...] WHERE LIKE '%�%' works partially, as it shows all affected columns, but also all datasets containing a normal ? character (which makes this method not usable for me, as the output is far too big).

Does anyone know if there is a possibility to specifically only select the datasets containing the �?

Best regards, daZza

GarethD · Accepted Answer

The results will depend on your collation settings. I was able to get it to work with a few different ones. I think the most generic is this:

SELECT  *
FROM    (VALUES (N'A�A'), (N'A?A'), (N'AAA'), (N'�A'), (N'A�'), (N'�')) t (c)
WHERE   CHARINDEX(N'�' COLLATE Latin1_General_BIN, C) > 0;

Or

SELECT  *
FROM    (VALUES (N'A�A'), (N'A?A'), (N'AAA'), (N'�A'), (N'A�'), (N'�')) t (c)
WHERE   C LIKE N'%�%' COLLATE Latin1_General_BIN;

When using LIKE it only seems to work with collations that use a binary sort (although I have not tried all collations!). You can get these using:

SELECT name, description
FROM fn_helpcollations()

(T)SQL Query to filter all datasets, which contain �

Answers (1)

Related Questions