Chrishan
Chrishan

Reputation: 4106

Read unicode characters in XML in Java/Android

I was trying to get the XML output with some Unicode characters. I couldn't read the complete string inside the tag but just one.

here is my XML output

 <item>
    <id>1</id>    
    <name>&#x0DBD;&#x0DDC;&#x0DBD;&#x0DCA;</name>
    <cost>155</cost>
    <description>&#x0DBD;&#x0DDC;</description>
</item> 

This is my java code which I use to parse XML string.

    public Document getDomElement(String xml) {
Document doc = null;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {

    DocumentBuilder db = dbf.newDocumentBuilder();

    InputSource is = new InputSource();
    is.setEncoding("UTF-16");
    is.setCharacterStream(new StringReader(xml));
    doc = db.parse(is);

} catch (ParserConfigurationException e) {
    Log.e("Error: ", e.getMessage());
    return null;
} catch (SAXException e) {
    Log.e("Error: ", e.getMessage());
    return null;
} catch (IOException e) {
    Log.e("Error: ", e.getMessage());
    return null;
}
// return DOM
return doc;
}

When I use normal English characters it gives the complete string.

Upvotes: 0

Views: 3093

Answers (3)

Chrishan
Chrishan

Reputation: 4106

This is the code I used to solve my problem.

   NodeList idlist = doc.getElementsByTagName(KEY_ID);
    NodeList namelist = doc.getElementsByTagName(KEY_NAME);
    NodeList costlist = doc.getElementsByTagName(KEY_COST);
    NodeList desclist = doc.getElementsByTagName(KEY_DESC);
    for (int i=0; i<idlist.getLength(); i++)
    {
        Item item = new Item();
        item.setCost(costlist.item(i).getTextContent());
        item.setDescription(desclist.item(i).getTextContent());
        item.setName(namelist.item(i).getTextContent());
        itemarray.add(item);

    }

Upvotes: 0

helios
helios

Reputation: 13841

I've tried your code and there's no problem. If I evaluate the nodes with non-English chars the exists and have the correct number of chars. They're not printable because I don't have that glyphs in the font used, but value.codePointAt(i) returns the correct codepoint.

    NodeList list = doc.getDocumentElement().getChildNodes();
    for (int i=0; i<list.getLength(); i++)
    {
        String value = list.item(i).getTextContent();
        for (int j=0; j<value.length(); j++)
            System.out.print(" " + value.codePointAt(j));
        System.out.println();
    }

outputs:

 49
 3517 3548 3517 3530
 49 53 53
 3517 3548

which correspond to the decimal representation of your codepoints.

I've created the xml string by hand. You already have it in memory right?

Upvotes: 1

mauhiz
mauhiz

Reputation: 491

  • By Unicode people usually mean UTF-8 but you are using UTF-16, which is bad

  • XML defines its own encoding in its header so you should not need to override it

Upvotes: 0

Related Questions