Karim M. El Tel
Karim M. El Tel

Reputation: 438

Parsing an UTF-8 Encodded XML file

I have an XML File containing some Arabic Characters retrieved from a URL so I had to encode it in UTF-8 so it can handle such characters.

XML File:

<Entry>

    <lstItems>            
           <item>
        <id>1</id>
            <title>News Test 1</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/news1.jpg</img>
           </item>
           <item>
        <id>2</id>
            <title>كريم</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/news2.jpg</img>
           </item>
           <item>
        <id>3</id>
            <title>News Test 333</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/news3.jpg</img>
           </item> 
           <item>
        <id>4</id>
            <title>ربيع</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/cont20.jpg</img>
           </item> 
           <item>
        <id>5</id>
            <title>News Test 55555</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/cont21.jpg</img>
           </item>      
           <item>
        <id>6</id>
            <title>News Test 666666</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/cont22.jpg</img>
           </item>               
    </lstItems>
  </Entry>

I parsed the XML retrieved from a URL it as String as shown below:

public String getXmlFromUrl(String url) {

    try {
        return new AsyncTask<String, Void, String>() {
            @Override
            protected String doInBackground(String... params) {
                //String xml = null;
                try {
                    DefaultHttpClient httpClient = new DefaultHttpClient();
                    HttpGet httpPost = new HttpGet(params[0]);
                    HttpResponse httpResponse = httpClient.execute(httpPost);
                    HttpEntity httpEntity = httpResponse.getEntity();
                    xml = new String(EntityUtils.toString(httpEntity).getBytes(),"UTF-8");


                } catch (Exception e) {
                    e.printStackTrace();
                }
                return xml;




            }
        }.execute(url).get();
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ExecutionException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return xml;
}

Now the returned String is passed to this method to get a Document for later use as shown below:

public Document getDomElement(String xml){

        Document doc = null;
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

        try {

            DocumentBuilder db = dbf.newDocumentBuilder();
            InputSource is = new InputSource();
            StringReader xmlstring=new StringReader(xml);
            is.setCharacterStream(xmlstring);
            is.setEncoding("UTF-8");
                    //Code Stops here !
            doc = db.parse(is); 


        } catch (ParserConfigurationException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (SAXException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (IOException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        }
        // return DOM
        return doc;

}

an Error ocured with this message:

09-18 07:51:40.441: E/Error:(1210): Unexpected token (position:TEXT @1:4 in java.io.StringReader@4144c240) 

So the code crashes where I showed above with the following Error

09-18 07:51:40.451: E/AndroidRuntime(1210): java.lang.RuntimeException: Unable to start activity ComponentInfo{com.example.university1/com.example.university1.MainActivity}: java.lang.NullPointerException

Kindly note that the code works fine with ISO encoding.

Upvotes: 0

Views: 4971

Answers (2)

artbristol
artbristol

Reputation: 32407

This might not be the problem, but EntityUtils.toString(httpEntity).getBytes() is using the default platform encoding. You should use EntityUtils.toString(httpEntity) as the String, no need to turn it into bytes.

Also, read this http://kunststube.net/encoding/ for useful background on what's going on.

Upvotes: 1

Denys S&#233;guret
Denys S&#233;guret

Reputation: 382102

You've added a BOM in your UTF-8 file. Which is bad.

Maybe you edited your file with Notepad, or maybe you should check your editor to ensure it doesn't add a BOM.

As the BOM seems to be inside the text and not at start, you also need to remove it by using the delete key around its position (it's invisible in most editors). This may have happened during a file concatenation operation.

Upvotes: 2

Related Questions