Max Taggart
Max Taggart

Reputation: 823

How to Properly Decode Hex Values in RTF

Unfortunately this one goes down two rabbit holes, text encodings and RTF. But here it is.

Background

I am working on an NLP text pipeline where we need to convert RTF to plain text, in other words we need to remove the RTF control characters and leave the text content intact. We are building the pipeline in python and it has several requirements that prevent us from using something like Apache Tikka in production.

I know that RTF can contain hex values such as \'a9 if the author of the document typed a non-ascii character. I also know that the first sequence of control characters in the document specifies how to decode these hex values, e.g. \ansicpg1252. For example, in this case the presence of \ansicpg1252 at the beginning of the document means that \'a9 should be interpreted as unicode code point 00A9 (COPYRIGHT SIGN) as per the windows-1252 encoding.

Question

I came across an RTF document with \ansicpg1252 in the first set of control characters, however there are several places in the document where the following hex literals appear, \'81\'aa. This is confusing becuase 0x81 is undefined in the windows-1252 encoding. I thought maybe it could be utf-8, however it isn't defined in utf-8 either.

WordPad.exe represents these two bytes with this character: ↑

Apache Tikka uses the same character, ↑

This character corresponds to unicode code point 2191 (Upwards Arrow), and as it turns out our mystery bytes, 0x81AA, are the result of encoding this character using the Windows Code Page 932 encoding, which contains Japanese characters.

For reference, the full context of those two bytes in the RTF document is

\plain\f1\fs20 \'81\'aa\plain\f0\fs20

and the document contains this entry in the \fonttbl group:

{\f1\fmodern\fcharset128\fprq1 MS Mincho;}

which, as far as I understand, means that any text following \f1 should be rendered using the MS Mincho font, which kind of makes sense since MS Mincho contains Japanese glyphs. But how would an RTF parser know that 0x81AA should be decoded using Windows Code Page 932 and not ansicpg1252 as specified in the first line of the file? Do I need to know that certain fonts imply certain encodings?

My best guess is that it has something to do with the part of the \fonttbl entry that says \fcharset128, but I'm not sure.

Upvotes: 1

Views: 1325

Answers (1)

Jon Iles
Jon Iles

Reputation: 2579

After posting a comment, I did a bit more digging...

The fcharset argument comes from a fixed set of values, which map to the encoding used. Here's an example:

https://github.com/joniles/rtfparserkit/blob/master/src/main/java/com/rtfparserkit/parser/standard/FontCharset.java

From memory I think I picked these up from Microsoft's RTF spec document (https://www.microsoft.com/en-us/download/details.aspx?id=10725)

Upvotes: 1

Related Questions