spraff
spraff

Reputation: 33425

Why this elaborate RTF encoding of an apostrophe?

Scrivener produces RTF files with this elaborate apostrophe encoding:

They didn\loch\af0\hich\af0\dbch\af0\uc1\u8217\'92t do it.

Unicode 8217 is "Right Single Quotation Mark". Okay, but this RTF has that unicode character and \'92 as well. What's going on here?

Upvotes: 1

Views: 631

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 597166

That RTF breaks down to the following:

They didn    - plain text
\loch        - The text consists of single-byte low-ANSI (0x00–0x79) characters
\af0         - Associated Font Number 0
\hich        - The text consists of single-byte high-ANSI (0x80–0xFF) characters
\af0         - Associated Font Number 0
\dbch        - The text consists of double-byte characters
\af0         - Associated Font Number 0
\uc1         - number of bytes corresponding to a given \uN Unicode character
\u8217       - a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page
\'92         - A hexadecimal value, based on the specified character set (may be used to identify 8-bit values). 
t do it.     - plain text

Some of that is superfluous in this context and can be ignored, it is just font information. What is important is that \u8217 represents the apostrophe in Unicode, \'92 represents an equivalent apostrophe in ANSI, and \uc1 indicates the \'92 takes up 1 character. A Unicode-enabled RTF reader will handle \u8217 and ignore \'92. A non-Unicode RTF reader will ignore \u8217 and handle \'92. This is stated in the RTF spec for Unicode RTF:

\uN

This keyword represents a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.

This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.

...

An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.

Upvotes: 4

Related Questions