Reputation: 143

How to parse within CDATA in XML using Java

Upon searching through existing CDATA discussions, none that I found were able to achieve what I'm attempting.

Is it possible to parse within CDATA where the tag is not unique?

Below is the XML document where I'm attempting to retrieve each field within the CDATA block that has multiple fields of interest (i.e. Data Loaded, Quality, Status, Index) on line 5 below. Each field is marked with the "li" tag within the CDATA block (even though it's a character data space):

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Document>
 <name>area Area Date: 2014-07-31</name>
 <Placemark><name>P07L327</name><Point><coordinates>-96.26879,85.19125</coordinates></Point><description><![CDATA[<ol><li> Data Loaded:  NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>]]></description><Style> id = "colorIcon"</Style></Placemark>
 <coordinates>-96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,85.19125,0 -96.26879,45.14698,0 </coordinates>
</Document>
</kml>

Currently output is like this:

Name: <ol><li> Data Loaded:  NO</li><li>Quality: 5</li><li>Status: UP</li><li>Index: 72</li></eol>

From WITHIN the CDATA block, my intention is to output a new line for each field along with it's appropriate result.

Below is the code that's written up until now that gives the current output listed above:

    package com.lucy.seo;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.CharacterData;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import java.io.File;
import org.w3c.dom.CDATASection;
import org.w3c.dom.Comment;
import org.w3c.dom.Text;
import org.xml.sax.SAXException;


public class ReadXMLFile {

public static void main(String[] args ) throws Exception {

File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/Oracle_Java_Project/Test_Doc.xml");
    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);

doc.getDocumentElement().normalize();

System.out.println("Root element :" + doc.getDocumentElement().getNodeName());

NodeList nList = doc.getElementsByTagName("Placemark");

System.out.println("----------------------------");

for (int temp = 0; temp < nList.getLength(); temp++) {
    Element element = (Element) nList.item(temp);
            NodeList name = element.getElementsByTagName("description");
            Element line = (Element) name.item(0);
            System.out.println("Name: " + getCharacterDataFromElement(line));
    }
}
public static String getCharacterDataFromElement(Element f) {

         NodeList list = f.getChildNodes();
         String data;

         for(int index = 0; index < list.getLength(); index++){
             if(list.item(index) instanceof CharacterData){
                 CharacterData child  = (CharacterData) list.item(index);
                 data = child.getData();

                 if(data != null && data.trim().length() > 0)
                    return child.getData();
             }
         }
         return "";
}
}

Appreciate any help towards this! -- thanks!

Sep 2, 2014 update

Updated edit with final solution. Thank you to all here that posted solutions and helped. Solution was broken up into two pieces of code / files due to library conflicts:

//First file which is input to the second file followed afterwards

import java.io.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.CharacterData;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;


public class ReadXMLFile {

public static void main(String[] args ) throws Exception {
PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html"));
System.setOut(out);
File fXmlFile = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/raw_input.xml");
    DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(fXmlFile);


//optional, but recommended
//read this - http://stackoverflow.com/questions/13786607/normalization-in-dom-parsing-with-java-how-does-it-work
doc.getDocumentElement().normalize();

NodeList nList = doc.getElementsByTagName("Placemark");

    //create a buffered reader that connects to the console, we use it so we can read lines
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    System.out.println("<html xlmns=http://www.w3.org/1999/xhtml>");

for (int temp = 0; temp < nList.getLength(); temp++) {
                Node nNode = nList.item(temp);
                Element eElement = (Element) nNode;

    Element element = (Element) nList.item(temp);
            NodeList name = element.getElementsByTagName("description");
            Element line = (Element) name.item(0);

            System.out.println("<bracket><li>Name: " + eElement.getElementsByTagName("name").item(0).getTextContent() + "</li>");
            System.out.println("<description>Description: " + getCharacterDataFromElement(line) + "</description></bracket>");
    }
    System.out.println("</html>");

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}
public static String getCharacterDataFromElement(Element f) {

         NodeList list = f.getChildNodes();
         String data;

         for(int index = 0; index < list.getLength(); index++){
             if(list.item(index) instanceof CharacterData){
                 CharacterData child  = (CharacterData) list.item(index);
                 data = child.getData();

                 if(data != null && data.trim().length() > 0)
                    return child.getData();
             }
         }
         return "";
}
}


//Second File
package ReadXMLFile_part2;

import java.io.*;

import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.logging.Level;
import java.util.logging.Logger;

public class ReadXMLFile_part2 {

public static void main(String[] args) throws Exception {

PrintStream out = new PrintStream(new FileOutputStream("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/PA-PTH013_Output_Meters.xml"));
System.setOut(out);

System.out.println("*** JSOUP ***");

File input = new File("C:/XML_UltraEdit/XML_Sandbox/NetBeans_Java_Project/temp_file.html");
Document doc = null;
    try {
        doc = Jsoup.parse(input,"UTF-8", "http://www.w3.org/1999/xhtml" );
    } catch (IOException ex) {
        Logger.getLogger(ReadXMLFile_part2.class.getName()).log(Level.SEVERE, null, ex);
    }
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

Elements brackets = doc.getElementsByTag("bracket");

for (Element bracket : brackets) {
    Elements lis = bracket.select("li");

        for (Element li : lis){
        System.out.println(li.text());
        }
    break;
}
System.out.println();

//read a line from the console
String lineFromInput = in.readLine();

//output to the file a line
out.println(lineFromInput);                                 
out.close();    
}

}

Upvotes: 1

Answers (2)

Michael Kay

Reputation: 163645

Your question is something of a contradiction, since CDATA is an explicit instruction to the parser NOT to parse what it sees inside the CDATA. So the simplest way to get the content parsed is not to include the CDATA tags in the first place.

However, having told the parser not to parse the CDATA content, what you can do is extract the content as text, and then submit the text to the parser as a second parse operation.

Upvotes: 0

GPI

Reputation: 9348

CDATA is a marker to XML interpreting engines, that whatever they encounter in between the start and end, should be treated as "pure" (raw) character data.

So, in a way, it's like an escape character for the parser (one that can encompass many characters).

Therefore, you won't find a XML parser that will report whatever is inside a CDATA as XML because the norm says that it MUST report it as a character stream. (As a consequence : it MUST NOT interpret it as XML stream, which is actually good because nothing mandates the content to be XML indeed).

Anyway, your parser and your code is working as expected.

But if, as in your case, you happen to know that the content of a certain CDATA instance is indeed a valid XML instance, then you can open a new Parser for this precise content, and deal with it appropriatly.

So you can get the output of your getCharacterDataFromElement(line) call, feed it to your documentBuilder, and use this new Documentinstance to parse the content of your li elements.

Upvotes: 3

How to parse within CDATA in XML using Java

Sep 2, 2014 update

Answers (2)

Related Questions