sql: detect string containing multiple languages

Question

Column [Plate_No] in [my_Table] contains vehicle plate numbers for Greek and foreign vehicles. Generally Greek plate numbers contain only Greek characters, and foreign plates contain only Latin characters.

Now, some rows contain plate numbers with mixed characters and is incorrect.

for example:

DECLARE @my_Plates AS TABLE (
   Plate_No NVARCHAR(50)
)

INSERT INTO @my_Plates (
    Plate_No
)
SELECT 'AAA1234'
UNION ALL
SELECT 'ΑΑΑ1234'
UNION ALL
SELECT 'AΑA1234'

SELECT Plate_No, LOWER(Plate_No) AS LC_Plate_No
FROM @my_Plates

result:

Plate_No | LC_Plate_No
---------------------
AAA1234  | aaa1234      <-- correct
ΑΑΑ1234  | ααα1234      <-- correct
AΑA1234  | aαa1234      <-- incorrect

As you can see plate numbers are visually the same, but when LOWER applied, the difference is obvious.

How can those incorrect rows be detected ?

Any help would be appreciated.

Ed Harper · Accepted Answer

Here's one method which works when the column collation is Greek_CI_AS:

DECLARE @my_Plates AS TABLE (
   Plate_No NVARCHAR(50) COLLATE Greek_CI_AS
)

INSERT INTO @my_Plates (
    Plate_No
)
SELECT UPPER(N'aaa1234')
UNION ALL
SELECT UPPER(N'ααα1234')
UNION ALL
SELECT UPPER(N'aαa1234')

SELECT  Plate_No, LOWER(Plate_No) 
FROM    @my_Plates
WHERE   Plate_No <> CAST(Plate_No AS varchar(50)) COLLATE Latin1_General_CI_AS
AND     Plate_No LIKE '%[A-Z]%'

this probably won't perform well on large data sets - neither of the WHERE conditions will be able to use an index on Plate_No.

sql: detect string containing multiple languages

Answers (2)

Related Questions