Reputation: 33425
Scrivener produces RTF files with this elaborate apostrophe encoding:
They didn\loch\af0\hich\af0\dbch\af0\uc1\u8217\'92t do it.
Unicode 8217 is "Right Single Quotation Mark". Okay, but this RTF has that unicode character and \'92
as well. What's going on here?
Upvotes: 1
Views: 631
Reputation: 597166
That RTF breaks down to the following:
They didn - plain text
\loch - The text consists of single-byte low-ANSI (0x00–0x79) characters
\af0 - Associated Font Number 0
\hich - The text consists of single-byte high-ANSI (0x80–0xFF) characters
\af0 - Associated Font Number 0
\dbch - The text consists of double-byte characters
\af0 - Associated Font Number 0
\uc1 - number of bytes corresponding to a given \uN Unicode character
\u8217 - a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page
\'92 - A hexadecimal value, based on the specified character set (may be used to identify 8-bit values).
t do it. - plain text
Some of that is superfluous in this context and can be ignored, it is just font information. What is important is that \u8217
represents the apostrophe in Unicode, \'92
represents an equivalent apostrophe in ANSI, and \uc1
indicates the \'92
takes up 1 character. A Unicode-enabled RTF reader will handle \u8217
and ignore \'92
. A non-Unicode RTF reader will ignore \u8217
and handle \'92
. This is stated in the RTF spec for Unicode RTF:
\uN
This keyword represents a single Unicode character that has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.
...
An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change.
Upvotes: 4