Tom Harrison
Tom Harrison

Reputation: 14038

Ruby 2: Recognizing decomposed utf8 in XML entities (NFD)

Problem

Problem is simple: I have XML containing this value

Mu¨ller

This appears to be valid XML format for representing a u with an umlaut, like this.

Müller

But all the parsers we have tried so far result in -- two distinct characters.

Background

This form of unicode (UTF-8) uses two codepoints to represent a single character; and is called Normalized Form Decomposed or NFD, and in binary is \303\274.

Most characters can also be represented as a single codepoint and entity, including this case. The XML could also have included ü or ü or ü and in binary is \195\188. This is called Normalized Form Composed. Any of these would work fine.

Getting Right to the Question

So I think the question is one of:

Thanks!

Upvotes: 3

Views: 766

Answers (1)

matt
matt

Reputation: 79783

The character you’re using, U+00A8 (DIAERESIS) isn’t a combining character – it is distinct from U+0308 (COMBINING DIAERESIS). (I’ve only just discovered this myself – I don’t know what the use for the non-combining diaeresis is).

It looks like in this case this behaviour is correct and your XML is wrong (it should be using ̈ and not ¨).

Upvotes: 3

Related Questions