Reputation: 33
I am playing around with xml parsing and have learnt a bit from various resources. Im a beginner in the world of java and I'm still trying to get my head around things.
Currently I am stuck trying to parse something looking like this:
<poem>
<line>Hey diddle, diddle
<i>the cat</i> and the fiddle.
</line>
</poem>
That's not the actual xml but the real one doesn't looks a lot worse so I posted that instead (same idea, I guess)
Im trying to get an output of something like this:
Element : line
text : Hey diddle, diddle
element: i
text: the cat
text: and the fiddle.
------------------------
OR
------------------------
line: Hey diddle, diddle
i: the cat
and the fiddle
My code at the moment looks like this:
public class parsingWithDOM {
public static void main(String[] args) {
File xml = new File("/Users.../xmlTest.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(xml);
NodeList nList = doc.getElementsByTagName("line");
Node l = nList.item(0);
if (l.getNodeType() == Node.ELEMENT_NODE) {
Element line = (Element) l;
System.out.println(line.getTagName() + ": " + line.getTextContent());
NodeList lineList = line.getChildNodes();
for (int i = 0; i < lineList.getLength(); i++) {
Node node = lineList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element lineElement = (Element) node;
System.out.println(lineElement.getTagName() + ": " + lineElement.getTextContent());
}
}
}
} catch (IOException | ParserConfigurationException | DOMException | SAXException e) {
System.out.println(e.getMessage());
}
}
}
Anyway, the output I'm getting is this (Not quite what I am looking for)
line: Hey diddle, diddle the cat and the fiddle.
i: the cat
Any help would be very much appreciated 😊
Upvotes: 1
Views: 1050
Reputation: 163468
There are many tasks that are much easier done in XSLT than in Java/DOM, and this is one of them. Here's a solution using XSLT 3.0.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:f="http://local/"
exclude-result-prefixes="#all"
expand-text="yes"
version="3.0">
<xsl:output method="text" />
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:text>{f:indent(.)}ELEMENT {name()}</xsl:text>
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()">
<xsl:text>{f:indent(.)}{.}</xsl:text>
</xsl:template>
<xsl:function name="f:indent" as="xs:string">
<xsl:param name="node" as="node()"/>
<xsl:sequence select="'
' || string-join((1 to count($node/ancestor::*))!'__')"/>
</xsl:function>
</xsl:stylesheet>
The output is
ELEMENT poem
__ELEMENT line
____text: Hey diddle, diddle
____ELEMENT i
______text: the cat
____text: and the fiddle.
and you can see it in action at
https://xsltfiddle.liberty-development.net/gWEaSuR/1
To talk you through it:
xsl:output
says you want text output, rather than XML or HTML
xsl:strip-space
says ignore whitespace-only text nodes in the input
There are two xsl:template
rules, one for elements and one for text nodes
Both of these invoke a function f:indent
which generates indentation according to the depth of the node in the tree (found by counting ancestors)
Most of the work in this stylesheet is getting the output formatting right (the input navigation takes care of itself). I used underscores rather than spaces in the output so you can see the difference between whitespace that comes from the input, and whitespace generated by the stylesheet.
The JDK has a built-in XSLT 1.0 processor, but XSLT 3.0 has many extra features, and for that you'll want to install Saxon. Both processor can readily be invoked from Java applications.
Upvotes: 2
Reputation: 61
Below code should go as per your requirement:
import java.io.File;
import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class ParsingWithDOM {
public static void main(String[] args) {
File xml = new File("sample.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(xml);
StringBuilder sb_inner = new StringBuilder();
NodeList nList = doc.getElementsByTagName("line");
Node l = nList.item(0);
if (l.getNodeType() == Node.ELEMENT_NODE) {
Element line = (Element) l;
String outer = line.getTagName() + ": " + line.getTextContent();
NodeList lineList = line.getChildNodes();
for (int i = 0; i < lineList.getLength(); i++) {
Node node = lineList.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element lineElement = (Element) node;
sb_inner.append(lineElement.getTagName() + ": " + lineElement.getTextContent()).append("\n");
}
}
String sub = sb_inner.toString();
String []formatter = sub.split("\n");
for(int i=0; i< formatter.length; i++) {
outer = outer.replace(formatter[i].split(":")[1].trim(),
formatter[i]+"\n");
}
System.out.println(outer);
}
} catch (IOException | ParserConfigurationException | DOMException | SAXException e) {
System.out.println(e.getMessage());
}
}
}
Upvotes: -1
Reputation: 159145
You can do it like this, using the getFirstChild()
, getNextSibling()
, and getParentNode()
methods to navigate the DOM tree:
int level = 0;
Node node = doc.getDocumentElement();
while (node != null) {
// Process node
if (node.getNodeType() == Node.ELEMENT_NODE) {
System.out.println(" ".repeat(level) + "Element: \"" + node.getNodeName() + "\"");
} else if (node.getNodeType() == Node.TEXT_NODE || node.getNodeType() == Node.CDATA_SECTION_NODE) {
String text = node.getNodeValue()
.replace("\r", "\\r")
.replace("\n", "\\n")
.replace("\t", "\\t");
System.out.println(" ".repeat(level) + "Text: \"" + text + "\"");
}
// Advance to next node
if (node.getFirstChild() != null) {
node = node.getFirstChild();
level++;
} else {
while (node.getNextSibling() == null && node.getParentNode() != null) {
node = node.getParentNode();
level--;
}
node = node.getNextSibling();
}
}
The code uses the Java 11+ repeat​(int count)
method for indenting the text. For earlier versions of Java, use some other mechanism for that.
Output
Element: "poem"
Text: "\n "
Element: "line"
Text: "Hey diddle, diddle \n "
Element: "i"
Text: "the cat"
Text: " and the fiddle.\n "
Text: "\n"
Upvotes: 1