Piotr
Piotr

Reputation: 1447

Java REGEX XML parse/cut-down while maintaining structure HowTo

I am writing a RESTful web service in Java. The idea is to "cut down" an XML document and strip away all the unneeded content (~98%) and leave only the tags we're interested in, while maintaining the document's structure, which is as follows (I cannot provide the actual XML content for confidentiality reasons):

<sear:SEGMENTS xmlns="http://www.exlibrisgroup.com/xsd/primo/primo_nm_bib" xmlns:sear="http://www.exlibrisgroup.com/xsd/jaguar/search">
   <sear:JAGROOT>
      <sear:RESULT>
         <sear:DOCSET IS_LOCAL="true" TOTAL_TIME="176" LASTHIT="9" FIRSTHIT="0" TOTALHITS="262" HIT_TIME="11">
            <sear:DOC SEARCH_ENGINE_TYPE="Local Search Engine" SEARCH_ENGINE="Local Search Engine" NO="1" RANK="0.086826384" ID="2347460">
               [
               <PrimoNMBib>
                  <record>
                     <display>
                        <title></title>
                     </display>
                     <sort>
                        <author></author>
                     </sort>
                  </record>
               </PrimoNMBib>
               ]
            </sear:DOC>
         </sear:DOCSET>
      </sear:RESULT>
   </sear:JAGROOT>
</sear:SEGMENTS>

Of course, this is the structure of only the tags we are interested in - there are hundreds more tags, but they are irrelevant.

The square brackets ([]) are not part of the XML and indicate that the element <PrimoNMBib></PrimoNMBib> are elements of a list of children and occur more than once - one per match of the search from the RESTFUL service.

I've been trying to parse the document with regular expressions, as to leave only the segments of the structure as shown above along with the values of <title> and <author> while removing everything else in-between the tags including other tags, however I can't get it to work for the life of me...

Previously I tried it using XSLT, however for unresolved reasons that didn't work either... I'd already asked a question for the XSLT implementation...

Anyway, I would very much appreciate a tip/hint/solution as how to solve this problem using regex and Java...

Upvotes: 0

Views: 292

Answers (1)

bdoughan
bdoughan

Reputation: 149017

I wouldn't recommend using regex to manipulate XML.

Alternative Approach

You could use a StAX parser that leverages a StreamFilter to cut down the document and still maintain a valid structure.

How a StreamFilter Works

A StreamFilter receives event event from the XMLStreamReader, if you want to have the event reported you return true, otherwise false. In the example below the StreamFilter will reject anything in the "http://www.exlibrisgroup.com/xsd/jaguar/search" namespace. You will need to tweak the logic to get it to match the requirements of your use case.

Demo

package forum10351473;

import java.io.FileReader;
import javax.xml.stream.*;

public class Demo {

    public static void main(String[] args) throws Exception {
        XMLInputFactory xif = XMLInputFactory.newFactory();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum10351473/input.xml"));
        xsr = xif.createFilteredReader(xsr, new StreamFilter() {

            private boolean reportContent = false;

            @Override
            public boolean accept(XMLStreamReader reader) {
                if(reader.isStartElement() || reader.isEndElement()) {
                    reportContent = !"http://www.exlibrisgroup.com/xsd/jaguar/search".equals(reader.getNamespaceURI());
                }
                return reportContent;
            }

        });

        // The XMLStreamReader (xsr) will now only report the events you care about.
        // You can process the XMLStreamReader yourself or pass as input to something
        // like JAXB.
        while(xsr.hasNext()) {
            if(xsr.isStartElement()) {
                System.out.println(xsr.getLocalName());
            }
            xsr.next();
        }
    }

}

Output

PrimoNMBib
record
display
title
sort
author

Upvotes: 1

Related Questions