Reputation: 14038
Problem is simple: I have XML containing this value
Mu¨ller
This appears to be valid XML format for representing a u
with an umlaut, like this.
Müller
But all the parsers we have tried so far result in u¨
-- two distinct characters.
This form of unicode (UTF-8) uses two codepoints to represent a single character; and is called Normalized Form Decomposed or NFD, and in binary is \303\274
.
Most characters can also be represented as a single codepoint and entity, including this case. The XML could also have included ü
or ü
or ü
and in binary is \195\188
. This is called Normalized Form Composed. Any of these would work fine.
So I think the question is one of:
Thanks!
Upvotes: 3
Views: 766
Reputation: 79783
The character you’re using, U+00A8 (DIAERESIS
) isn’t a combining character – it is distinct from U+0308 (COMBINING DIAERESIS
). (I’ve only just discovered this myself – I don’t know what the use for the non-combining diaeresis is).
It looks like in this case this behaviour is correct and your XML is wrong (it should be using ̈
and not ¨
).
Upvotes: 3