Reputation: 443
I need to detect whether my table has 4-byte UTF-8 characters and delete them if any.
I tried following SQLs, but they return error ORA-12726: unmatched bracket in regular expression
. All brackets are matched, so error is incorrect.
select REGEXP_REPLACE(asd𠜎aasd',
'['
|| UTL_I18N.RAW_TO_CHAR ('010000', 'UTF8')
|| '-'
|| UTL_I18N.RAW_TO_CHAR ('01FFFF', 'UTF8')
|| ']', '')
from dual;
select REGEXP_REPLACE('asd𠜎aasd',
'['
|| chr(to_number('010000', 'xxxxxx'))
|| '-'
|| chr(to_number('01FFFF', 'xxxxxx'))
|| ']', '')
from dual;
What is wrong with these queries?
Do you know how to find rows by character code points range?
Upvotes: 2
Views: 3727
Reputation: 17238
When you convert from raw to char you must supply the octet sequence in the chosen encoding. What you supply are the Unicode code points but not their utf8 encoding. To get that, you can use online tools like this one or a code table. E.g. U+010000
is represented as f0908080
in utf-8 (range extended to U+010000
-U+02FFFF
to include source character).
So in the sample case, use the following:
select REGEXP_REPLACE('asd'||UTL_I18N.RAW_TO_CHAR ('F0A09C8E', 'AL32UTF8')||'aasd',
'['
|| UTL_I18N.RAW_TO_CHAR ('F0908080', 'AL32UTF8')
|| '-'
|| UTL_I18N.RAW_TO_CHAR ('F0AFBFBF', 'AL32UTF8')
|| ']', '')
from dual;
Upvotes: 2
Reputation: 59553
If you prefer RAW_TO_CHAR
, you can do it like this:
SELECT REGEXP_REPLACE('asd𠜎aasd',
'['
|| UTL_I18N.RAW_TO_CHAR ('F0908080', 'AL32UTF8')
|| '-'
|| UTL_I18N.RAW_TO_CHAR ('F48FBFBF', 'AL32UTF8')
|| ']', '')
FROM dual;
Oracle Character set UTF8
is not the same as commonly known encoding UTF-8. Oracle Character set UTF8
means CESU-8 encoding. For UTF-8 you have to use Oracle character set AL32UTF8
U+01FFFF
is not the biggest Unicode codepoint, your string may contain "冕" (U+02F8D3 CJK COMPATIBILITY IDEOGRAPH-2F8D3
). The biggest possible Unicode codepoint is U+10FFFF
, which is equal to UNISTR('\DBFF\DFFF')
(as provided by Alex Poole's answer), resp. UTL_I18N.RAW_TO_CHAR ('F48FBFBF', 'AL32UTF8')
.
btw, for the conversion I use this page: Unicode code converter
Upvotes: 1
Reputation: 191455
You can use the UNISTR function; the 𠜎 character is codepoint U+2070E, which in UTF-16 is D841DF0E. As the documentation notes:
Supplementary characters are encoded as two code units, the first from the high-surrogates range (U+D800 to U+DBFF), and the second from the low-surrogates range (U+DC00 to U+DFFF).
Which means you can represent it with:
select unistr('\D841\DF0E') from dual;
UNISTR('\D841\DF0E')
--------------------
𠜎
You can then use UNISTR to construct your range:
select REGEXP_REPLACE('asd𠜎aasd',
'['
|| UNISTR('\D800\DC00')
|| '-'
|| UNISTR('\DBFF\DFFF')
|| ']', '')
from dual;
REGEXP_REPLACE('ASD𠜎AASD','['||UNISTR('\D800\DC00')||'-'||UNISTR('\DBFF\DFFF')||']','')
----------------------------------------------------------------------------------------
asdaasd
Assuming you want to exclude all supplementary characters; you can adjust the range if you have a more narrow focus.
Upvotes: 3
Reputation: 7377
maybe this not what exacly you need but to solve the error ora-12726 use this query
select REGEXP_REPLACE('asdaasd',
'\['
|| chr(to_number('010000', 'xxxxxx'))
|| '-'
|| chr(to_number('01FFFF', 'xxxxxx'))
|| '\]', '')
from dual;
Upvotes: 2