Reputation: 214
The problem I have is that I have to work with an xml file the providers of the company I work for sent to me.
This would not be a problem if the xml was well constructed but it is not at all.
<catalog>
<product>
<ref>4780</ref>
.
.
.
<arrivals>
<product>
<image title="AMARILLO">AMA</image>
<size>S/T </size>
</product>
<product>
<image title="AZUL">AZUL</image>
<size>S/T </size>
</product>
</arrivals>
</product>
</catalog>
As you can see, the tag <product>
have all the information of the product but there are more tags named <product>
to distinguish when there are different colors.
This is the code I use to move in the xml.
doc = db.parse("filename.xml");
Element esproducte = (Element)doc.getElementsByTagName("product").item(0);
NodeList nArrv = esproducte.getElementsByTagName("arrivals");
Element eArrv = (Element) nArrv.item(0);
NodeList eProds = eArrv.getElementsByTagName("product");//THIS THING
for(int l=0; l<eProds.getLength(); l++)
{
Node ln = eProds.item(l);
if (ln.getNodeType() == Node.ELEMENT_NODE)
{
Element le = (Element) ln;
//COLORS / IMAGES / CONFIGS
NodeList nimgcol = le.getElementsByTagName("image");
Element eimgcol = (Element) nimgcol.item(0);
System.out.println("Name of the color " + eimgcol.getTextContent());
}
What happens is that the print is reapeated more times it should and I think it's because of the parent <product>
. I thought it shouldn't happen because where I wrote //THIS THING
I take into account the fact that <product>
is set in <arrivals>
. But it is not working.
What should I modify in the code to move only 2 times in the for and not 3, which is what happen in this case?
Solution:
NodeList eProds = eArrv.getElementsByTagName("product");//THIS THING
to
NodeList eProds = eArrv.getChildNodes();//THIS THING
And the rest exactly the same. Works perfect.
Upvotes: 0
Views: 2041
Reputation: 198
As Andreas mentioned there is nothing invalid about the document and the problem is using getElementsByTagName, which simply scans the entire document for any elements with that tag name, regardless of structure.
You can use XPath to simplify the traversal of specific elements.
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import java.io.IOException;
import java.io.StringReader;
public class XMLParsing {
public static void main(String[] args) throws ParserConfigurationException, IOException, SAXException, XPathExpressionException {
String xml = "<catalog>\n" +
" <product>\n" +
" <ref>4780</ref>\n" +
" .\n" +
" .\n" +
" .\n" +
" <arrivals>\n" +
" <product>\n" +
" <image title=\"AMARILLO\">AMA</image>\n" +
" <size>S/T </size>\n" +
" </product>\n" +
" <product>\n" +
" <image title=\"AZUL\">AZUL</image>\n" +
" <size>S/T </size>\n" +
" </product>\n" +
" </arrivals>\n" +
" </product>\n" +
"</catalog>\n";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xml)));
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
// get all products under "arrivals"
XPathExpression expression = xPath.compile("/catalog/product/arrivals//product");
NodeList nodes = (NodeList) expression.evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
Node product = nodes.item(i);
NodeList productChildren = product.getChildNodes();
for (int j = 0; j < productChildren.getLength(); j++) {
Node item = productChildren.item(j);
if (item instanceof Element) {
Element element = (Element) item;
switch (element.getTagName()) {
case "image":
System.out.println("product image title : " + element.getAttribute("title"));
break;
case "size":
System.out.println("product size : " + element.getTextContent());
break;
default:
break;
}
}
}
}
}
}
Upvotes: 1
Reputation: 3736
getElementsByTagName
give you all Tags with the name "product" that are inside that tag, including those "product" tags for colors.
Try use getChildNodes
and check the name of the Nodes instead
Upvotes: 1
Reputation: 159114
It is perfectly valid to have tags inside different parent elements that are named the same, but have different content/meaning, as is the case in your example.
An element whose path is /catalog/product
is entirely different from an element whose path is /catalog/product/arrivals/product
. As an example, both XPath and XML Schema will consider them distinct.
It is only lazily written code that cannot distinguish the difference, e.g. by using getElementsByTagName
, which locates elements anywhere ("all descendants") regardless of the location (path).
When processing the DOM tree, do it in a structured fashion:
catalog
).product
.product
:
product
element.ref
, arrivals
.arrivals
:
arrivals
element.product
.product
:
image
, size
.As you can see, the place in your code that handles an element named product
inside an element named catalog
is different from the code that handles an element named product
inside an element named arrivals
.
Upvotes: 1