Jack Sierkstra
Jack Sierkstra

Reputation: 1434

Retrieve XML node by line number and line position in Java

We are receiving XML's files that are valid according to a specification. There is an external party that checks the original XML file and generates warnings based upon the contents of the XML file. If there are warnings, this will result in two files:

The problem is that with each warning, they refer to that warning in the original file by line number and line position.

  <PositionInBericht>
    <LineNumber>78</LineNumber>
    <LinePosition>10</LinePosition>
  </PositionInBericht>

Unfortunately there is nothing we can change about it because it is written in the specification that it should behave like this. I was searching the interwebs for examples, but there isn't a lot to find that does what I want.

Resources I found were:

How should I use line number and column number to get element in XML in JAVA How should I use line number and column number to get element in XML in JAVA

Java / Groovy : Find XML node by Line number Java / Groovy : Find XML node by Line number

The solution that is provided in those posts are suboptimal or absent. I want to know if people have done this before and came up with a good solution.

Edit:

To help people, I found a solution. It basically does the following: specify line number and it will print out the information of the start element.

public class ParsingByLineNumberApplication {

/**
 * URL's gebruikt ter inspiratie voor dit project.
 *
 * How should I use line number and column number to get element in XML in JAVA
 * https://stackoverflow.com/questions/41225724/how-should-i-use-line-number-and-column-number-to-get-element-in-xml-in-java
 *
 * Java / Groovy : Find XML node by Line number
 * https://stackoverflow.com/questions/47701357/java-groovy-find-xml-node-by-line-number
 *
 * Parsing XML documents partially with StAX
 * https://www.ibm.com/developerworks/library/x-tipstx2/index.html
 *
 * @param args
 * @throws FileNotFoundException
 * @throws XMLStreamException
 * @throws URISyntaxException
 */
public static void main(String[] args) throws FileNotFoundException, XMLStreamException, URISyntaxException {
    printElementsAtLineNumber(53);
}

private static void printElementsAtLineNumber(int lineNumber) throws URISyntaxException, FileNotFoundException, XMLStreamException {
    URL resource = ParsingByLineNumberApplication.class.getClassLoader().getResource("test_file.XML");
    FileReader reader = new FileReader(new File(resource.toURI()));
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader xmlr = factory.createXMLStreamReader(reader);

    // Create a filtered stream reader
    XMLStreamReader xmlfr = factory.createFilteredReader(xmlr, filter);

    // Main event loop
    while (xmlfr.hasNext()) {

        // Process single event
        if (xmlfr.getEventType() == XMLStreamConstants.START_ELEMENT) {
            if (lineNumber == xmlfr.getLocation().getLineNumber()) {
                System.out.println("Character offset: " + xmlfr.getLocation().getCharacterOffset());
                System.out.println("Column number: " + xmlfr.getLocation().getColumnNumber());
                System.out.println("Element name: " + xmlfr.getName().getLocalPart());
                System.out.println("Line number: " + xmlr.getLocation().getLineNumber());
                System.out.println("Element text: " + xmlr.getElementText());
            }
        }

        // Move to next event
        xmlfr.next();
    }
}

private static QName[] exclude = new QName[]{
        new QName("invoice"), new QName("item")};

private static StreamFilter filter = new StreamFilter() {
    // Element level
    int depth = -1;
    // Last matching path segment
    int match = -1;
    // Filter result
    boolean process = true;
    // Character position in document
    int currentPos = -1;

    public boolean accept(XMLStreamReader reader) {
        // Get character position
        Location loc = reader.getLocation();
        int pos = loc.getCharacterOffset();
        // Inhibit double execution
        if (pos != currentPos) {
            currentPos = pos;
            switch (reader.getEventType()) {
                case XMLStreamConstants.START_ELEMENT:
                    // Increment element depth
                    if (++depth < exclude.length && match == depth - 1) {
                        // Compare path segment with current element
                        if (reader.getName().equals(exclude[depth]))
                            // Equal - set segment pointer
                            match = depth;
                    }
                    // Process all elements not in path
                    process = match < exclude.length - 1;
                    break;
                // End of XML element
                case XMLStreamConstants.END_ELEMENT:
                    // Process all elements not in path
                    process = match < exclude.length - 1;
                    // Decrement element depth
                    if (--depth < match)
                        // Update segment pointer
                        match = depth;
                    break;
            }
        }
        return process;
    }
};

}

Upvotes: 0

Views: 2186

Answers (1)

Michael Kay
Michael Kay

Reputation: 163342

SAX parsers reveal line number information; DOM parsers (and higher level tools such as JAXB) generally don't. I don't know what you want to do with the information once you've found it, but writing your application to use SAX for this sounds like hard work.

If you use Saxon then you have the option of retaining line and column numbers in the constructed tree (Saxon gets the information from the SAX parser and retains it in the tree). For example, you can request this using DocumentBuilder.setLineNumbering() in the s9api interface. If you're using XSLT, XPath, or XQuery then you can get the information using the extension functions saxon:line-number() or saxon:column-number() (requires Saxon-PE or -EE). You can also get the information from a Java application navigating the tree.

Note that the line number and column number returned for an element are as defined in the SAX specification: specifically, the position of the ">" at the end of the start tag. This may not exactly reflect the line and column given in your data file.

Upvotes: 2

Related Questions