Reputation: 263
I'm writing a rss feed parser with java and I've encountered a problem while parsing feed that have arabic/chinese/japanese characters. Example feed
When I print them I just get sets of question marks "?????? ?? ????? ??".
They end up in my database (mysql, connected by hibernate, has utf8 set as encoding) also like that.
This the part of the code that is responsible for getting the title from a feed:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(url.openStream());
doc.getDocumentElement().normalize();
Node channelNode = doc.getElementsByTagName("channel").item(0);
NodeList channelList = channelNode.getChildNodes();
for (int i = 0; i < channelList.getLength(); i++) {
Node element = channelList.item(i);
String name = element.getNodeName();
if (name.equalsIgnoreCase("title")) {
rssName = element.getTextContent();
break;
}
}
How to get the proper characters into the database ? When I copy them from the feed and insert manulally into the db its ok.
Thanx
UPDATE:
Putting additional lines in my hibernate config fixed the issue:
<property name="hibernate.connection.useUnicode">true</property>
<property name="hibernate.connection.characterEncoding">UTF-8</property>
Upvotes: 1
Views: 1467
Reputation: 1109302
You need to change the MySQL JDBC URL in Hibernate configuration to include the following params:
jdbc:mysql://hostname:3306/db_name?useUnicode=yes&characterEncoding=UTF-8
Otherwise the MySQL JDBC driver will use the client platform default encoding.
Your DB encoding is totally fine since manual insert works apparently fine. XML is usually by default parsed as UTF-8, so that part is fine as well (unless explicitly otherwise specified in the XML declaration header which is likely not the case since that would be a mistake of the RSS feed server).
Upvotes: 3
Reputation: 6149
It's clearly an encoding problem. You should try to decode the RSS stream using the UTF-8 charset.
Upvotes: -1
Reputation: 14468
You need to ensure that the character encoding of the database is compatible with such characters. Most likely by configuring it to be UTF-8.
If the database character encoding can not handle a character, it gets converted to ?.
Most databases will have an overall default encoding and then allow per table and per column overrides.
You will also need to ensure that you are parsing in incoming stream correctly (i.e. as UTF-8 or whatever encoding it specifies).
Upvotes: 0