baratali
baratali

Reputation: 443

How to select rows with 4-byte UTF-8 chars in Oracle DB?

I need to detect whether my table has 4-byte UTF-8 characters and delete them if any. I tried following SQLs, but they return error ORA-12726: unmatched bracket in regular expression. All brackets are matched, so error is incorrect.

select REGEXP_REPLACE(asd𠜎aasd', 
  '[' 
  || UTL_I18N.RAW_TO_CHAR ('010000', 'UTF8') 
  || '-' 
  || UTL_I18N.RAW_TO_CHAR ('01FFFF', 'UTF8') 
  || ']', '') 
from dual;

select REGEXP_REPLACE('asd𠜎aasd', 
  '[' 
  || chr(to_number('010000', 'xxxxxx')) 
  || '-' 
  || chr(to_number('01FFFF', 'xxxxxx')) 
  || ']', '') 
from dual;

What is wrong with these queries?

Do you know how to find rows by character code points range?

Upvotes: 2

Views: 3727

Answers (4)

collapsar
collapsar

Reputation: 17238

When you convert from raw to char you must supply the octet sequence in the chosen encoding. What you supply are the Unicode code points but not their utf8 encoding. To get that, you can use online tools like this one or a code table. E.g. U+010000 is represented as f0908080in utf-8 (range extended to U+010000-U+02FFFF to include source character).

So in the sample case, use the following:

select REGEXP_REPLACE('asd'||UTL_I18N.RAW_TO_CHAR ('F0A09C8E', 'AL32UTF8')||'aasd', 
  '[' 
  || UTL_I18N.RAW_TO_CHAR ('F0908080', 'AL32UTF8') 
  || '-' 
  || UTL_I18N.RAW_TO_CHAR ('F0AFBFBF', 'AL32UTF8') 
  || ']', '') 
from dual;

Upvotes: 2

Wernfried Domscheit
Wernfried Domscheit

Reputation: 59553

If you prefer RAW_TO_CHAR, you can do it like this:

SELECT REGEXP_REPLACE('asd𠜎aasd', 
  '[' 
  || UTL_I18N.RAW_TO_CHAR ('F0908080', 'AL32UTF8')
  || '-' 
  || UTL_I18N.RAW_TO_CHAR ('F48FBFBF', 'AL32UTF8') 
  || ']', '') 
FROM dual;

Oracle Character set UTF8 is not the same as commonly known encoding UTF-8. Oracle Character set UTF8 means CESU-8 encoding. For UTF-8 you have to use Oracle character set AL32UTF8

U+01FFFF is not the biggest Unicode codepoint, your string may contain "冕" (U+02F8D3 CJK COMPATIBILITY IDEOGRAPH-2F8D3). The biggest possible Unicode codepoint is U+10FFFF, which is equal to UNISTR('\DBFF\DFFF')(as provided by Alex Poole's answer), resp. UTL_I18N.RAW_TO_CHAR ('F48FBFBF', 'AL32UTF8').

btw, for the conversion I use this page: Unicode code converter

Upvotes: 1

Alex Poole
Alex Poole

Reputation: 191455

You can use the UNISTR function; the 𠜎 character is codepoint U+2070E, which in UTF-16 is D841DF0E. As the documentation notes:

Supplementary characters are encoded as two code units, the first from the high-surrogates range (U+D800 to U+DBFF), and the second from the low-surrogates range (U+DC00 to U+DFFF).

Which means you can represent it with:

select unistr('\D841\DF0E') from dual;

UNISTR('\D841\DF0E')
--------------------
𠜎

You can then use UNISTR to construct your range:

select REGEXP_REPLACE('asd𠜎aasd', 
  '[' 
  || UNISTR('\D800\DC00') 
  || '-' 
  || UNISTR('\DBFF\DFFF') 
  || ']', '') 
from dual;

REGEXP_REPLACE('ASD𠜎AASD','['||UNISTR('\D800\DC00')||'-'||UNISTR('\DBFF\DFFF')||']','')
----------------------------------------------------------------------------------------
asdaasd

Assuming you want to exclude all supplementary characters; you can adjust the range if you have a more narrow focus.

Upvotes: 3

Moudiz
Moudiz

Reputation: 7377

maybe this not what exacly you need but to solve the error ora-12726 use this query

select REGEXP_REPLACE('asdaasd', 
  '\[' 
  || chr(to_number('010000', 'xxxxxx')) 
  || '-' 
  || chr(to_number('01FFFF', 'xxxxxx')) 
  || '\]', '') 
from dual;

Upvotes: 2

Related Questions