Tangoo
Tangoo

Reputation: 1449

How to create XML string using dom4j with a title consisting of Chinese characters using GBK encoding?

I am going to generate XML string using dom4j, quite simple, shown below:

Document document = DocumentHelper.createDocument();

document.setXMLEncoding("GBK");

Element rss = document.addElement("rss");
rss.addAttribute("version", "2.0");
Element channel = rss.addElement("channel");
Element title = channel.addElement("title");
title.setText("中文");

System.out.println(document.asXML());

It prints out like below:

<?xml version="1.0" encoding="GBK"?>
<rss version="2.0"><channel><title>????</title></channel></rss>

I can't figure out why <title>????</title> happens, what should I do?

I have done quite a lot of search before asking.

Upvotes: 1

Views: 110

Answers (1)

Michael Gantman
Michael Gantman

Reputation: 7808

You may have one of 2 issues here:

  1. you wrote your data correctly but the editor that displays the data can not display those symbols
  2. The data is lost

So you need to find out which one. When I dealt with similar issues I used this tool that I wrote myself and published it as an Open source library. The utility allows you to convert symbols into Unicode sequences and vise-versa. Here is a small example:

String testStr1 = "中文";
String encoded1 = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(testStr1);
String restored = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encoded1);
System.out.println(testStr1 + "\n" +encoded1 + "\n" + restored);

The output of this code is:

中文
\u4e2d\u6587
中文

So here is what I would do: copy-paste your question marks from your xml and encode them into unicode sequences. If you see any codes similar to the ones above than your issue is a display issue and the content is correct. If you see something like \u003f\u003f\u003f\u003f (code \u003f is for '?') than you lost your info. There are different ways to deal with your problem but one quick workaround would be to use my utility and convert all the Chinese Strings into unicode sequences and than change the format from \u4e2d\u6587 to U+4e2d U+6587 (See Unicode in XML and other Markup Languages). The last conversion you will have to write on your own. So to summarize: you can use my utility for diagnosing the problem and then if you wish also for fixing the problem as well. The open-source library (written and maintained by me) is called MgntUtils and you can get it as Maven artifact here or on Github with source code and Javadoc included here. And here is a Javadoc for StringUnicodeEncoderDecoder class

Upvotes: 0

Related Questions