makingAMess
makingAMess

Reputation: 33

How to parse inline/mixed content xml elements in java with DOM

I am playing around with xml parsing and have learnt a bit from various resources. Im a beginner in the world of java and I'm still trying to get my head around things.

Currently I am stuck trying to parse something looking like this:

<poem>
    <line>Hey diddle, diddle 
        <i>the cat</i> and the fiddle.
    </line>
</poem>

That's not the actual xml but the real one doesn't looks a lot worse so I posted that instead (same idea, I guess)

Im trying to get an output of something like this:

Element : line
    text : Hey diddle, diddle
    element: i
        text: the cat
    text: and the fiddle.
------------------------ 
OR
------------------------ 

line:   Hey diddle, diddle
    i: the cat
    and the fiddle

My code at the moment looks like this:

public class parsingWithDOM {

    public static void main(String[] args) {
        File xml = new File("/Users.../xmlTest.xml");
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        try {
            DocumentBuilder db = dbf.newDocumentBuilder();
            Document doc = db.parse(xml);

            NodeList nList = doc.getElementsByTagName("line");
            Node l = nList.item(0);
            if (l.getNodeType() == Node.ELEMENT_NODE) {
                Element line = (Element) l;
                System.out.println(line.getTagName()  + ": " + line.getTextContent());
                NodeList lineList = line.getChildNodes();
                for (int i = 0; i < lineList.getLength(); i++) {
                    Node node = lineList.item(i);
                    if (node.getNodeType() == Node.ELEMENT_NODE) {
                        Element lineElement = (Element) node;
                        System.out.println(lineElement.getTagName() + ": " + lineElement.getTextContent());
                    }
                }
            }

        } catch (IOException | ParserConfigurationException | DOMException | SAXException e) {
            System.out.println(e.getMessage());
        }

    }
}

Anyway, the output I'm getting is this (Not quite what I am looking for)

line: Hey diddle, diddle the cat and the fiddle.

i: the cat

Any help would be very much appreciated 😊

Upvotes: 1

Views: 1050

Answers (3)

Michael Kay
Michael Kay

Reputation: 163468

There are many tasks that are much easier done in XSLT than in Java/DOM, and this is one of them. Here's a solution using XSLT 3.0.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:f="http://local/"
    exclude-result-prefixes="#all"
    expand-text="yes"
    version="3.0">

  <xsl:output method="text" />
  <xsl:strip-space elements="*"/>

  <xsl:template match="*">
    <xsl:text>{f:indent(.)}ELEMENT {name()}</xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="text()">
    <xsl:text>{f:indent(.)}{.}</xsl:text>
  </xsl:template>

  <xsl:function name="f:indent" as="xs:string">
      <xsl:param name="node" as="node()"/>
      <xsl:sequence select="'&#xa;' || string-join((1 to count($node/ancestor::*))!'__')"/>
  </xsl:function>

</xsl:stylesheet>

The output is

ELEMENT poem
__ELEMENT line
____text: Hey diddle, diddle 

____ELEMENT i
______text: the cat
____text: and the fiddle.

and you can see it in action at

https://xsltfiddle.liberty-development.net/gWEaSuR/1

To talk you through it:

  • xsl:output says you want text output, rather than XML or HTML

  • xsl:strip-space says ignore whitespace-only text nodes in the input

  • There are two xsl:template rules, one for elements and one for text nodes

  • Both of these invoke a function f:indent which generates indentation according to the depth of the node in the tree (found by counting ancestors)

Most of the work in this stylesheet is getting the output formatting right (the input navigation takes care of itself). I used underscores rather than spaces in the output so you can see the difference between whitespace that comes from the input, and whitespace generated by the stylesheet.

The JDK has a built-in XSLT 1.0 processor, but XSLT 3.0 has many extra features, and for that you'll want to install Saxon. Both processor can readily be invoked from Java applications.

Upvotes: 2

suhas_partha
suhas_partha

Reputation: 61

Below code should go as per your requirement:

import java.io.File;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class ParsingWithDOM {

    public static void main(String[] args) {
        File xml = new File("sample.xml");
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        try {
            DocumentBuilder db = dbf.newDocumentBuilder();
            Document doc = db.parse(xml);

            StringBuilder sb_inner = new StringBuilder();

            NodeList nList = doc.getElementsByTagName("line");
            Node l = nList.item(0);
            if (l.getNodeType() == Node.ELEMENT_NODE) {
                Element line = (Element) l;

                String outer = line.getTagName()  + ": " + line.getTextContent();

                NodeList lineList = line.getChildNodes();
                for (int i = 0; i < lineList.getLength(); i++) {
                    Node node = lineList.item(i);
                    if (node.getNodeType() == Node.ELEMENT_NODE) {
                        Element lineElement = (Element) node;
                        sb_inner.append(lineElement.getTagName() + ": " + lineElement.getTextContent()).append("\n");
                    }
                }

                String sub = sb_inner.toString();
                String []formatter = sub.split("\n");
                for(int i=0; i< formatter.length; i++) {
                    outer = outer.replace(formatter[i].split(":")[1].trim(), 
                    formatter[i]+"\n");
                }


                System.out.println(outer);

            }

        } catch (IOException | ParserConfigurationException | DOMException | SAXException e) {
            System.out.println(e.getMessage());
        }

    }
}

Upvotes: -1

Andreas
Andreas

Reputation: 159145

You can do it like this, using the getFirstChild(), getNextSibling(), and getParentNode() methods to navigate the DOM tree:

int level = 0;
Node node = doc.getDocumentElement();
while (node != null) {
    // Process node
    if (node.getNodeType() == Node.ELEMENT_NODE) {
        System.out.println("  ".repeat(level) + "Element: \"" + node.getNodeName() + "\"");
    } else if (node.getNodeType() == Node.TEXT_NODE || node.getNodeType() == Node.CDATA_SECTION_NODE) {
        String text = node.getNodeValue()
                .replace("\r", "\\r")
                .replace("\n", "\\n")
                .replace("\t", "\\t");
        System.out.println("  ".repeat(level) + "Text: \"" + text + "\"");
    }

    // Advance to next node
    if (node.getFirstChild() != null) {
        node = node.getFirstChild();
        level++;
    } else {
        while (node.getNextSibling() == null && node.getParentNode() != null) {
            node = node.getParentNode();
            level--;
        }
        node = node.getNextSibling();
    }
}

The code uses the Java 11+ repeat​(int count) method for indenting the text. For earlier versions of Java, use some other mechanism for that.

Output

Element: "poem"
  Text: "\n    "
  Element: "line"
    Text: "Hey diddle, diddle \n        "
    Element: "i"
      Text: "the cat"
    Text: " and the fiddle.\n    "
  Text: "\n"

Upvotes: 1

Related Questions