Reputation: 1226

Java Reading XML - Stops at '<' special character

I am making a practice application with the goal of reading data from an RSS feed.

So far it has gone well, except my application encounters an issue with special characters. It reads the first special character within the node, and then moves to the next node.

Any help would be much appreciated, and sorry for the large code blocks that follow.

RSS Feed - www.usu.co.nz/usu-news/rss.xml

<title>Unitec hosts American film students</title>
<link>http://www.usu.co.nz/node/4640</link>
<description>&lt;p&gt;If you’ve been hearing American accents around the Mt Albert campus over the past week.</description>

Display Code

String xml = XMLFunctions.getXML();
Document doc = XMLFunctions.XMLfromString(xml);

NodeList nodes = doc.getElementsByTagName("item");

for (int i = 0; i < nodes.getLength(); i++) 
{                           
    Element e = (Element)nodes.item(i);
    Log.v("XMLTest", XMLFunctions.getValue(e, "title"));
    Log.v("XMLTest", XMLFunctions.getValue(e, "link"));
    Log.v("XMLTest", XMLFunctions.getValue(e, "description"));  
    Log.v("XMLTest", XMLFunctions.getValue(e, "pubDate"));
    Log.v("XMLTest", XMLFunctions.getValue(e, "dc:creator"));
}

Reader Code

public class XMLFunctions 
{

public final static Document XMLfromString(String xml)
{

    Document doc = null;

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    try {

        DocumentBuilder db = dbf.newDocumentBuilder();

        InputSource is = new InputSource();
        is.setCharacterStream(new StringReader(xml));
        doc = db.parse(is); 

    } catch (ParserConfigurationException e) {
        System.out.println("XML parse error: " + e.getMessage());
        return null;
    } catch (SAXException e) {
        System.out.println("Wrong XML file structure: " + e.getMessage());
        return null;
    } catch (IOException e) {
        System.out.println("I/O exeption: " + e.getMessage());
        return null;
    }

    return doc;

}

/** Returns element value
  * @param elem element (it is XML tag)
  * @return Element value otherwise empty String
  */
 public final static String getElementValue( Node elem ) {
     Node kid;
     if(elem != null)
     {
         if (elem.hasChildNodes())
         {
             for(kid = elem.getFirstChild(); kid != null; kid = kid.getNextSibling())
             {
                 if( kid.getNodeType() == Node.TEXT_NODE  )
                 {
                     return kid.getNodeValue();
                 }
             }
         }
     }
     return "";
 }

 public static String getXML(){  
        String line = null;

        try {

            DefaultHttpClient httpClient = new DefaultHttpClient();
            HttpPost httpPost = new HttpPost("http://www.usu.co.nz/usu-news/rss.xml");

            HttpResponse httpResponse = httpClient.execute(httpPost);
            HttpEntity httpEntity = httpResponse.getEntity();
            line = EntityUtils.toString(httpEntity);

        } catch (UnsupportedEncodingException e) {
            line = "<results status=\"error\"><msg>Can't connect to server</msg></results>";
        } catch (MalformedURLException e) {
            line = "<results status=\"error\"><msg>Can't connect to server</msg></results>";
        } catch (IOException e) {
            line = "<results status=\"error\"><msg>Can't connect to server</msg></results>";
        }

        return line;

}

public static int numResults(Document doc){     
    Node results = doc.getDocumentElement();
    int res = -1;

    try{
        res = Integer.valueOf(results.getAttributes().getNamedItem("count").getNodeValue());
    }catch(Exception e ){
        res = -1;
    }

    return res;
}

public static String getValue(Element item, String str) {       
    NodeList n = item.getElementsByTagName(str);        
    return XMLFunctions.getElementValue(n.item(0));
}
}

Output

Unitec hosts American film students
http://www.usu.co.nz/node/4640
<
Wed, 01 Aug 2012 05:43:22 +0000
Phillipa

Upvotes: 2

Answers (5)

Don Roby

Reputation: 41137

Your function

public final static String getElementValue( Node elem ) {
    Node kid;
    if(elem != null)
    {
        if (elem.hasChildNodes())
        {
            for(kid = elem.getFirstChild(); kid != null; kid = kid.getNextSibling())
            {
                if( kid.getNodeType() == Node.TEXT_NODE  )
                {
                    return kid.getNodeValue();
                }
            }
        }
    }
    return "";
}

is returning the first text node under the given element. A chunk of text within a single tag can be split into multiple text nodes, and this tends to happen in the presence of special characters.

You should probably append all the text nodes into a string for the return value.

Something approximately like this might work:

public final static String getElementValue( Node elem ) {
    if ((elem == null) || (!(elem.hasChildNodes())))
        return "";

    Node kid;
    StringBuilder builder = new StringBuilder();
    for(kid = elem.getFirstChild(); kid != null; kid = kid.getNextSibling())
    {
        if( kid.getNodeType() == Node.TEXT_NODE  )
        {
            builder.append(kid.getNodeValue());
        }
    }
    return builder.toString();
}

Upvotes: 2

pap

Reputation: 27614

Slightly off-topic, but you might want to check out one of the already existing RSS frameworks, like ROME. Better than re-inventing the wheel.

Upvotes: 1

Ian Roberts

Reputation: 122394

Your code only extracts the first child text node from the element. The DOM spec allows multiple adjacent text nodes, so I suspect what's happening here is that your parser is representing the <, p, > and the remaining text as (at least) four separate text nodes. You will either need to concatenate the nodes together into one string, or call normalize() on the containing element node (which modifies the DOM tree to merge adjacent text nodes into one).

There are various libraries that can help you. For example, if your application uses the Spring framework then org.springframework.util.xml.DomUtils has a getTextValue static method that will extract the complete text value from an element.

Upvotes: 3

moody

Reputation: 33

Are you sure the XML string is not converted by the DefaultHttpClient? I tried your code and changed the method XMLFunctions.getXML() to feed the XML string directly instead of getting it by the DefaultHttpClient, the output is like

Unitec hosts American film students
http://www.usu.co.nz/node/4640
<p>If you’ve been hearing American accents around the Mt Albert campus over the past week.

as expected.

Upvotes: 0

Chris

Reputation: 5654

<?xml version="1.0" encoding="UTF-8"?> seems to be missing. Also there is no root-element.

Upvotes: 0

Java Reading XML - Stops at &#39;&lt;&#39; special character

Answers (5)

Related Questions

Java Reading XML - Stops at '<' special character