Mark McDonald
Mark McDonald

Reputation: 131

Removing accents from string in Snowflake

I am trying to remove all accents from a string column in Snowflake. In MSSQL I can use the collate function to do this. For example I have this string in a column: 'JESÚSSMITH12345'. I want to remove the accent from above the U:

In MSSQL I can do the following:

select ('JESÚSSMITH12345' Collate SQL_Latin1_General_CP1253_CI_AI) as identifier

returns: JESUSSMITH12345

However, I can't find a way to do this in Snowflake, other than using the TRANSLATE function. Using the TRANSLATE function isn't a good solution because, I would have to define every possible letter with an accent to do the translation. For example, this works in Snowflake:

select translate('JESÚSSMITH12345', 'Ú', 'U');

returns: JESUSSMITH12345

However, if a column contains any value other than U with an accent, I have to add that to my translation. It's not ideal.

Using the collate function in Snowflake, even with "ai" (a.k.a accent-insensitive), this still returns the string with the accented U.

For example in Snowflake:

select collate('JESÚSSMITH12345', 'en-ai');  

returns: JESÚSSMITH12345

Upvotes: 2

Views: 4754

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59225

Update: Better JS line using {Diacritic} by Greg Pavlik

return str.normalize("NFD").replace(/\p{Diacritic}/gu, "");

You can solve this with a JS UDF in Snowflake:

CREATE OR REPLACE FUNCTION normalize_js(S string)
  RETURNS string
  LANGUAGE JAVASCRIPT
  AS 'return S.normalize("NFD").replace(/[\u0300-\u036f]/g, "");'
;

select normalize_js('áéÉña');

-- 'aeEna'

I got that JS code from Remove accents/diacritics in a string in JavaScript.

Upvotes: 3

NickW
NickW

Reputation: 9788

the problem that you are facing is that U and Ú are completely different characters so from a coding/SQL perspective the statement "remove the accent from above the U" is meaningless.

The only option you have is to use, as you have done, the TRANSLATE function (or build equivalent functionality in a Stored Procedure). You will need to take every character that has an ascii value >= 128 and provide the character you want to replace it with. You could hard code this or, for more flexibility, put the accented chars, and the char you want to translate them to, into a table and use this in your SP

Upvotes: 1

Related Questions