tm1701
tm1701

Reputation: 7581

Specify utf-8 character encoding in RTF? The text (in UTF-8) format is correctly shown in Sqlite

How can I set the character encoding in RTF of characters that are in the UTF-8 character encoding format?

I studied similar questions, but did not fiund a good solution. So, I hope you can help.

The content is in a Sqlite database. The text in a Slqite database can only be formatted using UTF-8, UTF-16 or similar. So that's why I have to stick to UTF-8.

The e" is shown correctly using a Sqlite database browser.

The required target program, which can only read RTF, displays the characters in a strange way.

I tried for example:

{\rtf1\ansi\ansicpg0\uc0...
{\rtf1\ansi\ansicpg1252\uc0...
{\rtf1\ansi\ansicpg65001\uc0...

An option is by mapping the special characters to their RTF-char equivalences, as shown in this table.

Upvotes: 1

Views: 7969

Answers (2)

tm1701
tm1701

Reputation: 7581

I read in many places that RTF doesn't have a UTF-8 standard solution.

So, I created my own converter after scanning half the internet. If you have a standard/better solution, please let me know!

So after studying this book and I created a converter based on these character mappings. Great resources.

This solved my question. Re-using other solutions is what I would like to do for this kind of features, but I was not able to find one, alas.

The converter could be something like:

public static String convertHtmlToRtf(String html) {
    String tmp = html.replaceAll("\\R", " ")
            .replaceAll("\\\\", "\\\\\\\\")
            .replaceAll("\\{", "\\\\{")
            .replaceAll("}", "\\\\}");
    tmp = tmp.replaceAll("<a\\s+target=\"_blank\"\\s+href=[\"']([^\"']+?)[\"']\\s*>([^<]+?)</a>",
            "{\\\\field{\\\\*\\\\fldinst HYPERLINK \"$1\"}{\\\\fldrslt \\\\plain \\\\f2\\\\b\\\\fs20\\\\cf2 $2}}");
    tmp = tmp.replaceAll("<a\\s+href=[\"']([^\"']+?)[\"']\\s*>([^<]+?)</a>",
            "{\\\\field{\\\\*\\\\fldinst HYPERLINK \"$1\"}{\\\\fldrslt \\\\plain \\\\f2\\\\b\\\\fs20\\\\cf2 $2}}");

    tmp = tmp.replaceAll("<h3>", "\\\\line{\\\\b\\\\fs30{");
    tmp = tmp.replaceAll("</h3>", "}}\\\\line\\\\line ");
    tmp = tmp.replaceAll("<b>", "{\\\\b{");
    tmp = tmp.replaceAll("</b>", "}}");
    tmp = tmp.replaceAll("<strong>", "{\\\\b{");
    tmp = tmp.replaceAll("</strong>", "}}");
    tmp = tmp.replaceAll("<i>", "{\\\\i{");
    tmp = tmp.replaceAll("</i>", "}}");
    tmp = tmp.replaceAll("&amp;", "&");
    tmp = tmp.replaceAll("&quot;", "\"");
    tmp = tmp.replaceAll("&copy;", "{\\\\'a9}");
    tmp = tmp.replaceAll("&lt;", "<");
    tmp = tmp.replaceAll("&gt;", ">");
    tmp = tmp.replaceAll("<br/?><br/?>", "{\\\\pard \\\\par}\\\\line ");
    tmp = tmp.replaceAll("<br/?>", "\\\\line ");
    tmp = tmp.replaceAll("<BR>", "\\\\line ");
    tmp = tmp.replaceAll("<p[^>]*?>", "{\\\\pard ");
    tmp = tmp.replaceAll("</p>", " \\\\par}\\\\line ");
    tmp = convertSpecialCharsToRtfCodes(tmp);
    return "{\\rtf1\\ansi\\ansicpg0\\uc0\\deff0\\deflang0\\deflangfe0\\fs20{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil Tahoma;}{\\f2\\fnil\\fcharset0 Tahoma;}}{\\colortbl;\\red0\\green0\\blue0;\\red0\\green0\\blue255;\\red0\\green255\\blue0;\\red255\\green0\\blue0;}" + tmp + "}";
}

 private static String convertSpecialCharsToRtfCodes(String input) {
    char[] chars = input.toCharArray();
    StringBuffer sb = new StringBuffer();
    int length = chars.length;
    for (int i = 0; i < length; i++) {
        switch (chars[i]) {
            case '’':
                sb.append("{\\'92}");
                break;
            case '`':
                sb.append("{\\'60}");
                break;
            case '€':
                sb.append("{\\'80}");
                break;
            case '…':
                sb.append("{\\'85}");
                break;
            case '‘':
                sb.append("{\\'91}");
                break;
            case '̕':
                sb.append("{\\'92}");
                break;
            case '“':
                sb.append("{\\'93}");
                break;
            case '”':
                sb.append("{\\'94}");
                break;
            case '•':
                sb.append("{\\'95}");
                break;
            case '–':
            case '‒':
                sb.append("{\\'96}");
                break;
            case '—':
                sb.append("{\\'97}");
                break;
            case '©':
                sb.append("{\\'a9}");
                break;
            case '«':
                sb.append("{\\'ab}");
                break;
            case '±':
                sb.append("{\\'b1}");
                break;
            case '„':
                sb.append("\"");
                break;
            case '´':
                sb.append("{\\'b4}");
                break;
            case '¸':
                sb.append("{\\'b8}");
                break;
            case '»':
                sb.append("{\\'bb}");
                break;
            case '½':
                sb.append("{\\'bd}");
                break;
            case 'Ä':
                sb.append("{\\'c4}");
                break;
            case 'È':
                sb.append("{\\'c8}");
                break;
            case 'É':
                sb.append("{\\'c9}");
                break;
            case 'Ë':
                sb.append("{\\'cb}");
                break;
            case 'Ï':
                sb.append("{\\'cf}");
                break;
            case 'Í':
                sb.append("{\\'cd}");
                break;
            case 'Ó':
                sb.append("{\\'d3}");
                break;
            case 'Ö':
                sb.append("{\\'d6}");
                break;
            case 'Ü':
                sb.append("{\\'dc}");
                break;
            case 'Ú':
                sb.append("{\\'da}");
                break;
            case 'ß':
            case 'β':
                sb.append("{\\'df}");
                break;
            case 'à':
                sb.append("{\\'e0}");
                break;
            case 'á':
                sb.append("{\\'e1}");
                break;
            case 'ä':
                sb.append("{\\'e4}");
                break;
            case 'è':
                sb.append("{\\'e8}");
                break;
            case 'é':
                sb.append("{\\'e9}");
                break;
            case 'ê':
                sb.append("{\\'ea}");
                break;
            case 'ë':
                sb.append("{\\'eb}");
                break;
            case 'ï':
                sb.append("{\\'ef}");
                break;
            case 'í':
                sb.append("{\\'ed}");
                break;
            case 'ò':
                sb.append("{\\'f2}");
                break;
            case 'ó':
                sb.append("{\\'f3}");
                break;
            case 'ö':
                sb.append("{\\'f6}");
                break;
            case 'ú':
                sb.append("{\\'fa}");
                break;
            case 'ü':
                sb.append("{\\'fc}");
                break;
            default:
                if( chars[i] != ' ' && isSpaceChar( chars[i])) {
                    System.out.print( ".");
                    //sb.append("{\\~}");
                    sb.append(" ");
                } else if( chars[i] == 8218) {
                    System.out.println("Strange comma ... ");
                    sb.append(",");
                } else if( chars[i] > 132) {
                    System.err.println( "Special code that is not translated in RTF: '" + chars[i] + "', nummer=" + (int) chars[i]);
                    sb.append(chars[i]);
                } else {
                    sb.append(chars[i]);
                }
        }
    }
    return sb.toString();
}

Upvotes: 1

AmigoJack
AmigoJack

Reputation: 6109

The site you mentioned links to Unicode in RTF:

If the character is between 255 and 32,768, express it as \uc1\unumber*. For example, , character number 21,487, is \uc1\u21487* in RTF.

If the character is between 32,768 and 65,535, subtract 65,536 from it, and use the resulting negative number. For example, is character 36,947, so we subtract 65,536 to get -28,589 and we have \uc1\u-28589* in RTF.

If the character is over 65,535, then we can’t express it in RTF

Looks like RTF doesn't know UTF-8 at all, only Unicode in general. Other answers for Java and C# just use the \u directly.

Upvotes: 3

Related Questions