Reputation: 7581
How can I set the character encoding in RTF of characters that are in the UTF-8 character encoding format?
I studied similar questions, but did not fiund a good solution. So, I hope you can help.
The content is in a Sqlite database. The text in a Slqite database can only be formatted using UTF-8, UTF-16 or similar. So that's why I have to stick to UTF-8.
The e" is shown correctly using a Sqlite database browser.
The required target program, which can only read RTF, displays the characters in a strange way.
I tried for example:
{\rtf1\ansi\ansicpg0\uc0...
{\rtf1\ansi\ansicpg1252\uc0...
{\rtf1\ansi\ansicpg65001\uc0...
An option is by mapping the special characters to their RTF-char equivalences, as shown in this table.
Upvotes: 1
Views: 7969
Reputation: 7581
I read in many places that RTF doesn't have a UTF-8 standard solution.
So, I created my own converter after scanning half the internet. If you have a standard/better solution, please let me know!
So after studying this book and I created a converter based on these character mappings. Great resources.
This solved my question. Re-using other solutions is what I would like to do for this kind of features, but I was not able to find one, alas.
The converter could be something like:
public static String convertHtmlToRtf(String html) {
String tmp = html.replaceAll("\\R", " ")
.replaceAll("\\\\", "\\\\\\\\")
.replaceAll("\\{", "\\\\{")
.replaceAll("}", "\\\\}");
tmp = tmp.replaceAll("<a\\s+target=\"_blank\"\\s+href=[\"']([^\"']+?)[\"']\\s*>([^<]+?)</a>",
"{\\\\field{\\\\*\\\\fldinst HYPERLINK \"$1\"}{\\\\fldrslt \\\\plain \\\\f2\\\\b\\\\fs20\\\\cf2 $2}}");
tmp = tmp.replaceAll("<a\\s+href=[\"']([^\"']+?)[\"']\\s*>([^<]+?)</a>",
"{\\\\field{\\\\*\\\\fldinst HYPERLINK \"$1\"}{\\\\fldrslt \\\\plain \\\\f2\\\\b\\\\fs20\\\\cf2 $2}}");
tmp = tmp.replaceAll("<h3>", "\\\\line{\\\\b\\\\fs30{");
tmp = tmp.replaceAll("</h3>", "}}\\\\line\\\\line ");
tmp = tmp.replaceAll("<b>", "{\\\\b{");
tmp = tmp.replaceAll("</b>", "}}");
tmp = tmp.replaceAll("<strong>", "{\\\\b{");
tmp = tmp.replaceAll("</strong>", "}}");
tmp = tmp.replaceAll("<i>", "{\\\\i{");
tmp = tmp.replaceAll("</i>", "}}");
tmp = tmp.replaceAll("&", "&");
tmp = tmp.replaceAll(""", "\"");
tmp = tmp.replaceAll("©", "{\\\\'a9}");
tmp = tmp.replaceAll("<", "<");
tmp = tmp.replaceAll(">", ">");
tmp = tmp.replaceAll("<br/?><br/?>", "{\\\\pard \\\\par}\\\\line ");
tmp = tmp.replaceAll("<br/?>", "\\\\line ");
tmp = tmp.replaceAll("<BR>", "\\\\line ");
tmp = tmp.replaceAll("<p[^>]*?>", "{\\\\pard ");
tmp = tmp.replaceAll("</p>", " \\\\par}\\\\line ");
tmp = convertSpecialCharsToRtfCodes(tmp);
return "{\\rtf1\\ansi\\ansicpg0\\uc0\\deff0\\deflang0\\deflangfe0\\fs20{\\fonttbl{\\f0\\fnil Tahoma;}{\\f1\\fnil Tahoma;}{\\f2\\fnil\\fcharset0 Tahoma;}}{\\colortbl;\\red0\\green0\\blue0;\\red0\\green0\\blue255;\\red0\\green255\\blue0;\\red255\\green0\\blue0;}" + tmp + "}";
}
private static String convertSpecialCharsToRtfCodes(String input) {
char[] chars = input.toCharArray();
StringBuffer sb = new StringBuffer();
int length = chars.length;
for (int i = 0; i < length; i++) {
switch (chars[i]) {
case '’':
sb.append("{\\'92}");
break;
case '`':
sb.append("{\\'60}");
break;
case '€':
sb.append("{\\'80}");
break;
case '…':
sb.append("{\\'85}");
break;
case '‘':
sb.append("{\\'91}");
break;
case '̕':
sb.append("{\\'92}");
break;
case '“':
sb.append("{\\'93}");
break;
case '”':
sb.append("{\\'94}");
break;
case '•':
sb.append("{\\'95}");
break;
case '–':
case '‒':
sb.append("{\\'96}");
break;
case '—':
sb.append("{\\'97}");
break;
case '©':
sb.append("{\\'a9}");
break;
case '«':
sb.append("{\\'ab}");
break;
case '±':
sb.append("{\\'b1}");
break;
case '„':
sb.append("\"");
break;
case '´':
sb.append("{\\'b4}");
break;
case '¸':
sb.append("{\\'b8}");
break;
case '»':
sb.append("{\\'bb}");
break;
case '½':
sb.append("{\\'bd}");
break;
case 'Ä':
sb.append("{\\'c4}");
break;
case 'È':
sb.append("{\\'c8}");
break;
case 'É':
sb.append("{\\'c9}");
break;
case 'Ë':
sb.append("{\\'cb}");
break;
case 'Ï':
sb.append("{\\'cf}");
break;
case 'Í':
sb.append("{\\'cd}");
break;
case 'Ó':
sb.append("{\\'d3}");
break;
case 'Ö':
sb.append("{\\'d6}");
break;
case 'Ü':
sb.append("{\\'dc}");
break;
case 'Ú':
sb.append("{\\'da}");
break;
case 'ß':
case 'β':
sb.append("{\\'df}");
break;
case 'à':
sb.append("{\\'e0}");
break;
case 'á':
sb.append("{\\'e1}");
break;
case 'ä':
sb.append("{\\'e4}");
break;
case 'è':
sb.append("{\\'e8}");
break;
case 'é':
sb.append("{\\'e9}");
break;
case 'ê':
sb.append("{\\'ea}");
break;
case 'ë':
sb.append("{\\'eb}");
break;
case 'ï':
sb.append("{\\'ef}");
break;
case 'í':
sb.append("{\\'ed}");
break;
case 'ò':
sb.append("{\\'f2}");
break;
case 'ó':
sb.append("{\\'f3}");
break;
case 'ö':
sb.append("{\\'f6}");
break;
case 'ú':
sb.append("{\\'fa}");
break;
case 'ü':
sb.append("{\\'fc}");
break;
default:
if( chars[i] != ' ' && isSpaceChar( chars[i])) {
System.out.print( ".");
//sb.append("{\\~}");
sb.append(" ");
} else if( chars[i] == 8218) {
System.out.println("Strange comma ... ");
sb.append(",");
} else if( chars[i] > 132) {
System.err.println( "Special code that is not translated in RTF: '" + chars[i] + "', nummer=" + (int) chars[i]);
sb.append(chars[i]);
} else {
sb.append(chars[i]);
}
}
}
return sb.toString();
}
Upvotes: 1
Reputation: 6109
The site you mentioned links to Unicode in RTF:
If the character is between 255 and 32,768, express it as
\uc1\unumber*
. For example, , character number 21,487, is\uc1\u21487*
in RTF.
If the character is between 32,768 and 65,535, subtract 65,536 from it, and use the resulting negative number. For example, is character 36,947, so we subtract 65,536 to get -28,589 and we have
\uc1\u-28589*
in RTF.
If the character is over 65,535, then we can’t express it in RTF
Looks like RTF doesn't know UTF-8 at all, only Unicode in general. Other answers for Java and C# just use the \u
directly.
Upvotes: 3