paul
paul

Reputation: 13471

Parser CDATA xml

Having an XML with an embedded XML inside a [CDATA] any idea how can we parser that xml?

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<a>
    <b>
        <c>
            <![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?><bigXML>]]>
        </c>
    </b>
</a>

I cannot use regex/replace over the value of c since the embedded XML is an xml of 250mb size, and if I try ant of those operators I got a Java Heap Out of memory.

Upvotes: 0

Views: 649

Answers (1)

Eritrean
Eritrean

Reputation: 16498

You may try to use Jsoup. Jsoup is actually an html parser, but is also capable of parsing xml. It is quite intuitive and once you are familiar with the selector syntax it is very easy to use. You can parse the content of your cdata to a CDataNode and use the built-in methods to get what you need.

Maven dependency:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Modified your very simplified xml given above to have an example to play around with:

import org.jsoup.Jsoup;
import org.jsoup.nodes.CDataNode;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;

public class TestJavaClass {

    public static void main(String[] args) {
        String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>\n"
                + "<axx>\n"
                + "    <bxx>\n"
                + "        <cxx>\n"
                + "            <![CDATA["
                + "                     <?xml version=\"1.0\"?>\n"
                + "                         <catalog>\n"
                + "                             <book id=\"bk101\">\n"
                + "                                 <author>Gambardella, Matthew</author>\n"
                + "                                 <title>XML Developer's Guide</title>\n"
                + "                                 <genre>Computer</genre>\n"
                + "                                 <price>44.95</price>\n"
                + "                                 <publish_date>2000-10-01</publish_date>\n"
                + "                                 <description>An in-depth look at creating applications \n"
                + "                                 with XML.</description>\n"
                + "                             </book>\n"
                + "                             <book id=\"bk102\">\n"
                + "                                 <author>Ralls, Kim</author>\n"
                + "                                 <title>Midnight Rain</title>\n"
                + "                                 <genre>Fantasy</genre>\n"
                + "                                 <price>5.95</price>\n"
                + "                                 <publish_date>2000-12-16</publish_date>\n"
                + "                                 <description>A former architect battles corporate zombies, \n"
                + "                                 an evil sorceress, and her own childhood to become queen \n"
                + "                                 of the world.</description>\n"
                + "                             </book>"
                + "                         </catalog>"
                + "                 ]]>\n"
                + "        </cxx>\n"
                + "    </bxx>\n"
                + "</axx>\n";

        Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
        CDataNode cdata = (CDataNode) doc.selectFirst("cxx").childNode(1);

        Document cdataDoc = Jsoup.parse(cdata.text(),"", Parser.xmlParser());
        Elements authors = cdataDoc.select("book author");
        authors.forEach(aut -> {
            System.out.println(aut.text());
        });
    }
}

Output:

Gambardella, Matthew
Ralls, Kim

EDIT

Working with large files might end up in OOM exception. I haven't tried it yet but acording to this post Jsoup has an implemented stream reader to work with large files. Try the overloded parse method which accepts an InputStream if you are facing out of memory errors:

public static org.jsoup.nodes.Document parse(@Nullable java.io.InputStream in,
                                         String charsetName,
                                         String baseUri,
                                         org.jsoup.parser.Parser parser)

so in the above case something like:

InputStream in = new FileInputStream(new File("path to your xml file"));
Document doc = Jsoup.parse(in, "UTF-8", "", Parser.xmlParser());
....

Upvotes: 1

Related Questions