Does oracle store unicode text in a particular normalized form?

Question

I would like to know if an Oracle Unicode database stores data in a normalized form, or if Oracle guarentees that text returned from a query is in a certain normalized form.

This seems like it should be an easy question to answer, but I find no information on it on the web -- which leads me to think the answer is no. Does anyone have the skinny on this?

sstan · Accepted Answer

Notice what the Oracle documentation on Canonical Equivalence says:

Canonical equivalence is an attribute of a multilingual collation and describes how equivalent code point sequences are sorted. If canonical equivalence is applied in a particular multilingual collation, then canonically equivalent strings are treated as equal.

One Unicode code point can be equivalent to a sequence of base letter code points plus diacritic code points. This is called the Unicode canonical equivalence. For example, ä equals its base letter a and an umlaut. A linguistic flag, CANONICAL_EQUIVALENCE = TRUE, indicates that all canonical equivalence rules defined in Unicode need to be applied in a specific multilingual collation. Oracle Database-defined multilingual collations include the appropriate setting for the canonical equivalence flag. You can set the flag to FALSE to speed up the comparison and ordering functions if all the data is in its composed form.

So, basically, Oracle has a CANONICAL_EQUIVALENCE flag that you can configure to control how Oracle compares and considers decomposed/composed forms of the same Unicode logical character during sorting.

The very existence of this flag implies that Oracle does not normalize (compose or decompose) Unicode characters automatically when it stores the data. If Oracle did perform the normalization automatically when storing data, the use of the flag would be nonsensical and useless.

Does oracle store unicode text in a particular normalized form?

Answers (1)

Related Questions