darkhie
darkhie

Reputation: 263

Parsing arabic/chinese/japanese rss feeds with java

I'm writing a rss feed parser with java and I've encountered a problem while parsing feed that have arabic/chinese/japanese characters. Example feed

When I print them I just get sets of question marks "?????? ?? ????? ??".

They end up in my database (mysql, connected by hibernate, has utf8 set as encoding) also like that.

This the part of the code that is responsible for getting the title from a feed:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

Document doc = db.parse(url.openStream());
doc.getDocumentElement().normalize();

Node channelNode = doc.getElementsByTagName("channel").item(0);

NodeList channelList = channelNode.getChildNodes();

for (int i = 0; i < channelList.getLength(); i++) {
    Node element = channelList.item(i);

    String name = element.getNodeName();

    if (name.equalsIgnoreCase("title")) {
     rssName = element.getTextContent();
     break;
    }
}

How to get the proper characters into the database ? When I copy them from the feed and insert manulally into the db its ok.

Thanx

UPDATE:
Putting additional lines in my hibernate config fixed the issue:

<property name="hibernate.connection.useUnicode">true</property>  
<property name="hibernate.connection.characterEncoding">UTF-8</property>

Upvotes: 1

Views: 1467

Answers (3)

BalusC
BalusC

Reputation: 1109302

You need to change the MySQL JDBC URL in Hibernate configuration to include the following params:

jdbc:mysql://hostname:3306/db_name?useUnicode=yes&characterEncoding=UTF-8

Otherwise the MySQL JDBC driver will use the client platform default encoding.

Your DB encoding is totally fine since manual insert works apparently fine. XML is usually by default parsed as UTF-8, so that part is fine as well (unless explicitly otherwise specified in the XML declaration header which is likely not the case since that would be a mistake of the RSS feed server).

Upvotes: 3

Olivier Croisier
Olivier Croisier

Reputation: 6149

It's clearly an encoding problem. You should try to decode the RSS stream using the UTF-8 charset.

Upvotes: -1

Kris
Kris

Reputation: 14468

You need to ensure that the character encoding of the database is compatible with such characters. Most likely by configuring it to be UTF-8.

If the database character encoding can not handle a character, it gets converted to ?.

Most databases will have an overall default encoding and then allow per table and per column overrides.

You will also need to ensure that you are parsing in incoming stream correctly (i.e. as UTF-8 or whatever encoding it specifies).

Upvotes: 0

Related Questions